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Abstract 

Universal  hash  functions  that  exhibit  clogn-wise  independence  are  shown  to  give  a  performance 
in  double  hashing  and  virtually  any  reasonable  generalization  of  double  hashing  that  has  an  expected 
probe  count  of  y-U-  +  e  for  the  insertion  of  the  cm- th  item  into  a  table  of  size  n.  for  any  fixed 
a  <  1  and  e  >  0.  This  performance  is  within  e  of  optimal.  These  results  are  derived  from  a  novel 
formulation  that  overestimates  the  expected  probe  count  by  underestimating  the  presence  of  partial 
items  already  inserted  into  the  hash  table,  and  from  a  sharp  analysis  of  the  underlying  stochastic 
structures  formed  by  colliding  items. 
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Summary 

This  paper  gives  the  first  performance  bounds  for  classical  closed  hashing  schemes  in  the  case  of 
limited  randomness,  and  thereby  provides  the  first  randomized  performance  analysis  of  these  al¬ 
gorithms  in  a  model  that  supports  programmable  computation.  In  contrast,  traditional  analyses 
have  relied  upon  the  use  of  mathematically  random  hash  functions  or  the  assumption  that  the 
input  data  is  completely  random.  Unfortunately,  real  data  is  seldom  provably  random,  and  the 
average  program  size  of  a  random  hash  function  is  so  large  it  exceeds  the  size  of  the  database  it  is 
intended  to  support.  The  bounds  for  limited  randomness  establish  near  optimal  randomized  perfor¬ 
mance  for  classical  double  hashing  when  restricted  to  programmable  hash  functions.  Moreover,  the 
proof  technique  unifies  and  significantly  generalizes  previous  results  even  in  the  case  of  unlimited 
randomness. 

Let  D  =  (a?i,  S2,  ■  •  ■ ,  xan)  be  a  sequence  of  an  distinct  search  keys,  for  a  <  1,  belonging  to 
the  universe  U  =  {0, 1, . . . ,  m  —  1}.  The  objective  is  to  hash  D  into  a  search  table  of  n  locations 
without  the  use  of  pointers  and  without  relocating  placed  items.  We  show  that  for  any  fixed  load 
a  <  1,  universal  classes  of  clog?7-wise  independent  hash  functions  yield  the  same  expected  probe 
performance  as  fully  random  hash  functions  for  double  hashing,  with  an  error  of  e.  That  is, 
the  expected  number  of  probes  to  insert  the  cm-th  item  is  q.  e  when  clog??-wise  independent 
functions  are  used  instead  of  idealized  mathematically  random  functions.  The  positive  constant  c 
depends  on  a  and  e,  but  not  n.  The  error  e  can  be  any  fixed  positive  quantity.  A  consequence  is 
that  O(loglogm  +  log  n)  random  bits  suffice  for  these  hash  schemes.  Moreover,  subsequent  work, 
which  builds  upon  the  theorems  and  lemmas  of  this  paper,  has  reduced  the  error  from  e  to  an 
optimalO(h)  [17], 

Our  performance  bound  for  double  hashing  readily  applies  to  any  generalization  that  exhibits 
approximate  pairwise  independence  for  the  first  O(logn)  probes  of  any  item,  features  statistically 
independent  probe  functions  for  any  O(logn)  items,  and  is  robust  in  the  sense  that  insertions  must 
eventually  succeed,  provided  the  table  is  not  full,  and  that  probe  locations  cannot  be  revisited  too 
often,  during  the  insertion  of  an  individual  key.  In  these  cases,  the  expected  probe  count  to  locate 
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the  an-th  item  is  again  bounded  by  +  e.  The  performance  bound  for  these  generalizations  is 
new  even  in  the  case  of  full  randomness. 

When  combined  with  the  highly  independent  fast  hash  functions  of  [16],  these  results  give  the 
hrst  randomized  classical  closed  hashing  schemes  featuring,  for  a  word  model  of  computation,  a 
constant  number  of  arithmetic  operations  per  probe  and  nearly  optimal  probe  performance. 

These  results  are  derived  from  a  novel  formulation  that  overestimates  the  expected  probe  count 
by  underestimating  the  presence  of  local  items  already  inserted  into  the  hash  table,  and  from  a 
sharp  analysis  of  the  underlying  stochastic  structures  formed  by  colliding  items. 

1.0  Introduction  and  background 

Let  D  —  (x|.  x-j:  •  •  • ,  J'on)  be  a  sequence  of  an  distinct  search  keys,  a  <  1,  belonging  to  the 
universe  U  —  {0, 1, . . . ,  m  —  1}.  We  wish  to  hash  D  into  a  table  L[0..n  -  1].  In  closed  hashing,  which 
is  also  called  open  addressing,  all  data  must  be  placed  within  the  hash  table,  and  pointers  will  not  be 
allowed.  In  this  model,  each  key  x  e  U  is  mapped  into  a  probe  sequence  p(x ,  1  ),p(x,  2), ...  e  [0,  n  -  1] 
(which  ideally  would  be  a  permutation  of  [0,??  -  1]),  and  the  generic  insertion  scheme  is  to  place 
x  in  the  hrst  vacant  table  location  in  its  probe  secpience.  The  search  procedure  is  to  traverse  the 
same  sequence  until  the  item  is  located,  or  an  empty  table  slot  is  identified,  in  which  case  the  item 
is  known  to  be  absent. 

Uniform  hashing  is  an  idealized  model  where  the  probe  sequence  p(x,*),  for  each  key  x  e  U,  is 
assumed  to  be  a  fully  independent  random  function  (or  permutation).  Traditional  double  hashing, 
which  originates  in  the  1968  Ph.D.  thesis  of  Guy  de  Balbine  [10],  defines  p{x,j  )  =  /(•'•)  -  u  - 
l)d(x)  mod  n,  where  the  table  size  n  is  prime,  f(x )  is  assumed  to  return  an  arbitrarily  selected 
integer  in  [0..n  -  1],  and  d(x')  is  an  arbitrarily  selected  value  in  [l,..n—  1],  The  2|U|  random 
values  {(d(x  are  assumed  to  be  fully  independent  and  uniformly  distributed  over  their 

respective  ranges. 

In  the  most  common  versions  of  these  hashing  models,  the  probe  sequences  are  used  to  place 
a  key  in  its  hrst  vacant  probe  location,  as  opposed  to  some  earlier  position  with  the  concomitant 
relocation  of  its  former  occupant.  Relocation  schemes,  as  originated  by  Brent  (c.f.  [1]),  have  been 
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designed  as  a  means  to  reduce  the  expected  search  time  in  applications  where  search  operations 
are  much  more  frequent  than  insertions.  Such  schemes,  however,  are  not  the  subject  of  this  study. 
Accordingly,  we  shall  hereafter  refer  solely  to  models  of  closed  hashing  without  rearrangement. 

Contributions  begun  by  Knuth,  [10],  Ullman  [18],  Ajtai,  Ivomlos,  and  Szemeredi  [2]  culminate 
in  a  proof  by  Yao  [20],  who  showed  that  in  terms  of  retrieval  cost,  uniform  hashing  is  optimal:  no 
fixed  set  of  hash  functions  can  perform  better  than  the  random  ones  of  uniform  hashing.  For  double 
hashing,  work  by  Guibas  and  Szemeredi  [8]  and  subsequent  results  by  Lueker  and  Molodowitch  [11] 
culminate  in  a  proof  that  for  random  functions  /  and  d  and  any  fixed  load  factor  a  <  1,  the  expected 
number  of  probes  to  insert  the  (  cm+l)-st  item  is  ^+0(!2S^),  which  is  asymptotically  equivalent 
to  uniform  hashing,  and  hence  optimal. 

An  important  consequence  of  these  analyses  is  the  certainty  that  only  two  random  functions 
need  to  be  defined  to  provide  an  optimal  hashing  scheme.  Unfortunately,  the  question  of  what 
computable  functions  can  be  proven  to  behave  like  random  hash  functions  has  been  open.  The 
formal  use  of  fully  random  functions  on  U  leaves  our  understanding  of  computable  hashing  in  an 
unsatisfactory  state,  since  such  a  function  has  a  program  size  (Kolmogorov  complexity)  of  about 
2\U\  log  n  bits,  on  average.  We  could  use  polynomials  of  degree  n- 1  to  implement  n-  wise  independent 
random  functions  (c.f.  Definition  1)  to  reduce  the  spatial  cost  to  O(log  logra  +  n  log  n),  while  raising 
the  time  needed  for  evaluating  the  hash  function  to  n.  which  is  no  better  than  the  time  needed  to 
search  an  unordered  list. 

A  more  constructive  perspective  on  the  traditional  analyses  such  as  [11]  is  that  they  establish 
optimal  performance  bounds  for  hash  schemes  that  use  programmable  functions,  provided  the  mea¬ 
sure  of  running  time  is  taken  to  be  the  performance  averaged  over  all  possible  input  sequences.  This 
is  not  the  same  as  a  randomized  performance  bound,  where  the  expected  running  time  for  any  fixed 
sequence  of  data  is  shown  to  be  optimal.  The  problem  of  sorting  makes  this  distinction  especially 
clear.  Suppose  we  wish  to  sort  n  integers  in  the  range  [0,m  —  1].  It  is  widely  believed  that  no 
algorithm,  when  m  is  arbitrary,  can  run  in  linear  time.  Yet  if  integer  division  of  log  m-bit  words  is 
taken  to  be  a  unit  time  primitive,  then  a  Binsort  of  the  data  into  the  intervals  [*^,  (i  +  1)—^-  —  1],  for 
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i  —  0, 1, 2, . . followed  by  a  sorting  of  each  partition  will  run  in  linear  time  on  the  average,  because 
the  Binsort  partitions  the  data  into  pieces  exhibiting  an  average  size  of  1  and  a  variance  slightly 
below  1.  There  are,  of  course,  many  data  sets  where  the  partitioning  is  useless. 

Carter  and  Wegman  [6]  and  [19],  contributed  to  our  understanding  of  limited  randomness  and 
randomized  performance  by  introducing  the  notion  of  universal  hash  functions,  and  showing  that 
these  functions,  when  used  for  open  hashing  with  separate  chaining,  result  in  an  expected  probe 
performance  that  is  equivalent  to  the  fully  random  formulation.  In  this  model,  L[i]  comprises  a 
linked  list  of  items  hashing  to  the  value  i\  external  storage  is  used  to  hold  colliding  items  and 
pointer  information  linking  them  together. 

In  particular,  Carter  and  Wegman  exhibited  the  universal  classes  of  uniformly  distributed  h-wise 
independent  hash  functions,  (which  they  called  universal^): 

Fh,n  =  if  I  /(*')  =  (  aJxJ  mod  p)  mod  n-  ai  e  [°>P-  !]},  (!) 

0<j<h 

where  p  >  m  is  prime.  They  showed  that  if,  for  any  D  c  U,  a  hash  function  is  randomly  selected 
from  Fft  n  (independent  of  D).  then  the  sum  of  the  expected  j-th  moments  of  the  chain  (i.e.,  list) 
lengths  is  essentially  the  same  as  that  resulting  from  fully  random  functions,  for  j  <  h.  For  separate 
chaining,  the  second  moment  determines  the  expected  retrieval  time,  whence  pairwise  independence 
guarantees  optimal  expected  performance. 

Carter  and  Wegman  also  posed  as  an  open  question  whether  a  comparable  result  could  be 
achieved  for  any  form  of  closed  hashing,  such  as  double  hashing.  We  resolve  the  question  affirma¬ 
tively  if,  for  a  sufficiently  large  c,  c  log  n- wise  independent  hash  functions  are  used. 

For  our  purposes,  however,  the  evaluation  time  of  their  universal  hash  functions  is  too  high,  but 
the  results  of  [16]  exhibit  such  families  with  constant  evaluation  time  for  a  standard  word  model 
Random  Access  Machine.  (The  model  in  [16]  follows  the  standard  conventions  of  allowing  indexed 
access  to  a  size  n  array  in  constant  time.  A  constant  number  of  multiplications  and  integer  divisions 
are  deemed  to  require  0(1)  time  for  keys  in  the  universe  U  ,  but  most  of  the  (constant  number  of) 
operations  are  on  O(logn)-bit  words.  The  requisite  number  of  random  bits  still  turns  out  to  be 
O(loglogm  +  log  n), j  although  the  O(log  n)  dependence  is  increased  by  a  constant  factor,  and 
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auxiliary  storage  of  ne  (logn)-bit  words  is  provably  necessary,  for  some  e  <  1.) 

The  attraction  of  a  universal  hashing  formulation  is  two-fold.  First,  the  notion  provides  a  way 
to  construct  randomized  algorithms  for  hashing,  which  eliminates  any  requirement  that  the  data 
be  “random,”  for  suitable  performance.  Second,  it  allows  a  fixed  -  presumably  computable  -  set  of 
functions  to  be  used  for  hashing,  as  opposed  to  an  axiomatically  specified  “random”  function. 

This  work  differs  considerably  from  the  analyses  of  [8]  and  [11]  in  that  both  analyze  the  inter¬ 
section  patterns  of  arithmetic  progressions,  whereas  this  work  has  no  notion  of  such  a  sequence. 
Byproducts  of  such  a  proof  formulation  include  the  following. 

1)  Automatic  generalization  of  our  performance  results  to  arbitrary  hashing  schemes  that  satisfy 
(minor)  requirements  regarding  adequate  table  coverage  and  that  (more  importantly)  exhibit 
approximate  pairwise  independent  probing,  which  can  be  formalized  as  follows. 

Va;  Vi  Prob{p(x ,  k)  =  £,  p(x,  i )  =  j}  <  (><  (j( 

Actually,  the  probe  sequence  can  be  defined  by  the  equation  p(x,j )  =  h(f(x),d(x),j ),  for 
random  /  and  d.  and  any  deterministic  function  h  as  long  as  pairwise  independence  is  assured 
for  the  hrst  O(logra)  probes,  and  reasonable  coverage  occurs  for  later  probes.  The  coverage 
requirements,  as  quantified  later,  simply  ensure  that  overall  probe  coverage  is  adequate  to 
guarantee  that  the  insertion  of  a  key  x  will  fail  only  when  the  table  is  full,  and  that  within  any 
x’s  probe  sequence,  locations  will  not  be  repeated  enough  to  degrade  the  resulting  performance. 

2)  A  proof  that  O(logn)-wise  independent  hash  functions  are  random  enough  to  preserve  the 
expected  performance  of  the  hashing  schemes  in  1),  up  to  a  fixed  error  of  e. 

The  rest  of  this  section  is  organized  as  follows.  Subsection  1.1  explains  why  a  two-level  hashing 
scheme  can  enable  the  use  of  functions  with  a  spatial  complexity  of  only  O(log  log  m+log  n)  random 
bits.  Subsection  1.2  formalizes  a  notion  of  limited  independence  with  requirements  that,  in  most 
respects,  are  slightly  stronger  than  the  definitions  generally  encountered  in  the  literature  (and  in 

tWe  use  the  Big-Oh  notation  in  the  following  standard  way:  f  —  g  +  0(h)  means  that  \f  —  g\  =  0(|/)|). 
Consequently,  there  is  no  distinction  between  /  +  0(g)  and  /  —  0(g).  Nevertheless,  we  shall,  upon  occasion, 
use  minus  signs  to  suggest  that  the  worst  case  error  is  negative.  Also,  it  is  quite  reasonable  to  write,  say, 
T+b[T)  =  1  -  0(h) ,  for  h  =  o(l). 
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other  respects  slightly  weaker)  but  are  still  readily  achievable.  It  also  presents  constructions  of  two- 
level  hash  functions  that  exhibit  the  statistical  randomness  required  by  our  analyses.  Subsection 
1.3  outlines  the  rest  of  the  paper. 

1.1  Reducing  the  domain 

Although  our  proofs  show  that  any  set  of  sufficiently  well  behaved  hash  functions  can  be  used  for 
double  hashing,  it  is  worth  noting  that  a  universal  class  of  linear  congruential  hash  functions  can 
be  used  to  map  the  data  D  into  a  polynomial  sized  space  such  as  [0,  n4]  in  a  collision-free  manner, 
with  high  probability.  Then  the  universal  class  T^re,(as  defined  in  Section  1.0),  can  be  restricted 
to  have  coefficients  of  size  0(n4),  as  opposed  to  size  0 ( m ) ,  which  might  be  much  larger.  Such 
mappings  can  be  pieced  together  from  techniques  in  [6],  [12]  and  [7].  Accordingly,  we  first  exploit 
the  following  variation  of  Lemma  2  from  [7]: 

Fact  1:  Let  Pj.  =  {p  \  p  is  prime  and  p  £  (nk  log  m,  (2  +  8)nk  log  ra)},  for  some  small  suitably 

fixed  8  >  0.  Then 

yx  e  D  :  Probpepk  {x  —  y  mod  p}  <  n~k . 

Proof:  [12], [7].  By  the  Prime  Number  Theorem,  \Pk\  —  ( 1  -o(l)).  The  product 

of  any  0'|P^|  primes  in  Pj,  is  bounded  below  by  (nk  log  >  (m)7”*  ,  whence  no  more  than 

7  <  l/nk  of  the  elements  of  Pj.  can  divide  \x  -  y  |.  | 


Fact  2:  Let  Fq(p)  —  {h  \  h(x)  —  (ax  +  6modp)  mod  A,  a  A  0,  b  £  [0,p-  1]},  where  p  >  nk  is 

prime.  Then 

Vx^ye  [0,p-  1]  :  ProbfeFo[p){f(x )  =  f(y)}  <  n~k. 

Proof:  [6].  Given  x  and  y,  xpy  £  [0,p-  1],  x  y,  the  number  of  different  f  £  Fo(p)  where 
f(x)  =  f(y),  is  precisely  the  number  of  2  x  2  linear  systems  in  a  and  b\ 


f  ax  +  b  —  c  +  dnk  mod  p, 
1  ay  +  b  —  c  +  enk  mod  p, 


where  c,  d,  e  >  0;  c  +  dnk  <  p;  c  <  nk;  e  p  d\  c  +  enk  <  p. 


Now  c  +  dnk  can  have  p  different  values.  The  remaining  parameter  e  cannot  be  set  to  d  because 
this  would  give  a  —  0.  Thus  there  are  at  most  \p/nk  —  1]  different  values  available  for  e.  Since  there 
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are  exactly  p(p  -  1)  different  functions  in  Fq,  and  the  number  of  /  where  fix)  —  fiy)  is  at  most 
p\p/n *  -  1]  =  p\Lp/nk\  <  P^pr:  the  result  follows.  | 

Combining  Facts  1  and  2  shows  that  a  hash  function  selected  at  random  from  Fq  =  \jpep  Fq(p) 
will,  with  probability  exceeding  1  -  2 map  D  into  [0,  -  1]  with  no  collisions  at  all  among 

its  (a2”)  pairs.  We  may  take  k  —  4,  so  that  the  probability  of  a  collision  is  below  1  /n2,  and  assume 
the  functions  h),  n  are  defined  for  p  «  n4. 

Because  of  this  preprocessing,  the  spatial  complexity  of  our  composite  universal  hash  functions 
Fh  n  o  Fq  is  O(loglogm  +  log2  n)  bits,  for  h  =  0(logra), 

1.2  Limited  randomness 

Since  the  randomness  of  our  hash  function  family  restricts  the  size  of  the  small  data  sets  where  the 
hashing  behavior  is  easy  to  analyze,  it  is  convenient  to  formalize  this  family  characteristic. 

Carter  and  Wegman  defined  a  family  of  hash  functions  F  with  domain  U  and  range  R  to  be 
strongly  universal ^  if 


v  Vl,  ■  ■  -,Vh  e  R,  V  distinct  xj.  ...,xheU  : 


\{f  e  F  :  f(xt)  =  Vi,i  =  1,2,...,  h}\  = 


J£L 

l^1 


so  that  the  fraction  of  functions  in  F  that  achieve  the  desired  mapping  of  the  x/s  is  the  same 
as  that  for  fully  random  functions.  This  definition  combines  the  requirements  of  uniformity  and 
h-  wise  independence.  The  specification  is  a  little  stronger  than  that  used  by  Carter  and  Wegman 
for  open  hashing,  and  was  introduced  by  them  for  application  in  cryptography  [19].  They  also 
gave  an  application  of  almost  universal 2  functions  where  the  function  density  JpL  is  multiplied  by 
a  constant  factor  and  used  as  an  upper  bound. 

Our  bounds  for  closed  hashing  are  so  dependent  upon  inclusion-exclusion  that  we  need  a  very 
precise  notion  of  almost  universal which  separates  the  uniformity  and  independence  requirements 
and  which  is  formalized  as  follows. 


Definition  1. 

We  say  that  a  set  of  functions  F  with  domain  U  and  range  R  is  an  h- wise  independent  universal 
family  of  hash  functions  with  0- tolerance  if  F  exhibits 
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(h)- wise  independence:  V  yi, . . . ,  yy  6  if,  V  distinct  xi, . . . ,  e  U  : 

I  !/  e  /•'  :•  /(x,-)  =  j/i,  *  =  1, 2, . . . ,  h}\  _  -j-p  | { /  6  F  :  /(x,-)  =  ;/(}| 

\F\  V  I /'  I 

and  near  uniformity: 

VyeR,V-xeif:  (l-P)^<\{feF:  f(x)  =  y}\  <  (1  +  /?)gj. 

Thus  the  family  of  hash  functions  has  a  nearly  uniform  distribution,  and  the  joint  probability 
distribution  on  any  subset  comprising  h  or  fewer  points  in  U .  exhibits  the  usual  multiplicative 
independence,  ft  is  worth  observing  that  the  function  classes  Fhn,  from  Section  1.0  are  (h)-wise 
independent  with  1  /  n? -tolerance  for  a  universe  of  size  n 4. 

On  the  other  hand,  our  premapping  step  for  larger  universes  will  not  quite  meet  the  multiplica¬ 
tive  requirement  of  Definition  1  because  Fq  will  have  too  many  hash  functions  that  map  a  sequences 
of  hash  keys  D  into  [0,  n4]  with  collisions.  If  we  ignore  such  unfortunate  cases,  and  charge,  say,  an 
0{n)  cost  per  insertion  for  such  instances,  then  our  performance  bounds  will  not  change,  and  we 
may  rely  on  the  constructions  of  Section  2  to  perform  well  enough  in  general.  Accordingly,  our  final 
randomness  characterization  is  as  follows. 

Definition  2. 

We  say  that  a  family  of  hash  functions  F  with  domain  U  and  range  R  is  effectively  (h)-wise 
independent  with  8-tolerance  if  for  each  D  c  U  with  \D\  <  n,  3 F  c  F  where  >  1  and  F  is 

(h)^-wise  independent  with  tolerance-/?  for  domain  D  and  range  R. 

We  shall  take  the  requirement  of  uniform  distribution  with  ,3-tolerance  to  be  understood,  and 
simply  refer  to  these  schemes  in  terms  of  their  limited  independence.  Section  1.1  gives  a  formal 
construction  where  for  any  fixed  set  D  c  U  of  n  input  keys,  all  but  1/n2  of  the  Fq  map  D  into  [0,  n4] 
in  a  collision  free  way,  and  the  subsequent  hashing  is  fully  (h)-wise  independent  with  tolerance 
p-.  Evidently,  this  family  is  effectively  (h)-wise  independent  with  tolerance- p  according  to  the 
requirements  of  Definition  2. 

In  our  formal  models,  a  family  of  hash  functions  H  comprises  a  finite  set  of  functions.  Given 
the  data  sequence  D ,  a  specific  hash  function  is  selected  by  randomly  choosing  a  function  from  H 
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with  each  element  equally  likely  to  be  selected.  The  statistical  properties  defined  by  Definitions  1 
and  2,  as  well  as  the  those  which  follow  in  Definition  4  are  with  respect  to  H . 

1.3  A  proof  outline 

Traditional  analyses  of  hashing  view  the  state  of  a  hash  table  as  a  stochastic  process  that  evolves 
over  a  duration  of  an  probabilistic  insertions.  Lueker  and  Molodowitch  [11],  for  example,  analyze 
double  hashing  in  the  fully  random  case  with  an  elegant  scheme  that  keeps  the  table  distribution 
uniform  by  introducing  moderately  improbable  randomizing  insertions  of  fake  items  to  correct  the 
distribution  at  each  insertion  step.  By  vigilantly  maintaining  a  fully  random  table  distribution, 
they  establish  a  simple  proof  that  double  hashing  and  uniform  hashing  exhibit  comparable  collision 
statistics  in  a  fairly  strong  sense,  and  this  intuition  has  turned  out  to  be  invaluable  in  this  current 
work,  which  establishes  an  even  closer  statistical  equivalence.  Unfortunately,  such  an  evolutionary 
approach  seems  to  be  inappropriate  for  instances  where  the  randomness  is  limited,  since  all  of  the 
randomness  would  be  used  up  after  logn  insertions.  Instead,  we  are  obliged  to  establish  the  bounds 
with  a  proof  technique  that  can  be  extended  from  uniform  hashing  to  double  hashing  with  full 
independence  to  a  comparable  double  hashing  with  limited  independence.  The  hashing  models  are 
fully  specified  in  Definition  4. 

Let  a  fixed  hashing  model  complete  with  (probabilistically  selected)  hash  functions  be  specified, 
and  consider  a  key  x  e  D.  We  may  define  its  dependency  set  dep(x,D)  (Definitions  5,6,  and  7)  to 
comprise  x  and  the  recursively  defined  members  of  the  dependency  sets  of  the  keys  that  occupy  the 
table  locations  probed  during  the  insertion  of  x. 

Given  a  subsequence  S  c  D,  one  may  ask,  what  is  the  probability  that  the  specific  items  in  S 
are  the  precise  and  full  cause  for  the  number  of  probes  needed  to  insert  x?  A  necessary  condition 
for  S  is  that  dep(x,S )  =  S  (c.f.  Definition  8  as  applied  to  P(k,k)  for  k  —  |$|,  and  Section  2.1.1). 
The  probability  that  the  probe  sequences  for  S  have  this  behavior  turns  out  to  be,  up  to  a  factor 
of  (1  +  0(1#)),  the  same  in  uniform  hashing  and  our  generalized  double  hashing  schemes,  as  long 
as  3|5|  does  not  exceed  the  amount  of  independence  of  our  hash  functions  (Lemma  2). 

Unfortunately,  there  can  be  many  subsequences  in  D  that,  in  the  absence  of  other  competing 
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sequences,  would  satisfy  the  collision  conditions  for  the  set  S.  We  may  characterize,  dep(x,D)  as 
that  special  S  where  dep{x,  S)  —  S  and  each  y  e  S  turns  out  to  encounter  no  z  e  D  -  S  residing  in 
its  probe  locations  when  it  is  inserted  as  a  member  of  the  full  sequence  D:  its  probe  locations  must 
only  contain  elements  from  S  (Lemma  3).  The  probability,  that  each  y  in  S  satisfies  this  latter 
criterion  when  all  of  D  is  hashed,  is  more  difficult  to  estimate.  We  define  the  formal  notion  of  a 
multiplicative  vacancy  estimator  q(t)  (Definition  14)  that  gives  an  overestimate  of  the  probability 
that  a  given  location  will  be  empty  when  aq  is  hashed  as  the  f-th  element  in  D. 

Then  an  explicit  overestimate  for  the  expected  number  of  probes  to  insert  xQn  can  actually  be 
calculated  for  uniform  hashing  and  any  multiplicative  vacancy  overestimator  q  (Theorem  1). 

The  most  complicated  calculation  for  double  hashing  is  to  estimate  the  probability  that 
an  arbitrarily  specified  sequence  of  probe  locations  I  c  [1, 2,  —  1]  satisfies  the  claim 

Vj  <  |T|  :  ( L[Ij\  is  vacant  prior  to  the  insertion  of  x-y.)  (Lemma  6  and  Theorem  3).  We  define  a 
quantifiable  notion  of  weak  vacancy  estimator  (Definition  15)  where  location  £  is  “vacant”  at  time  t 
if  no  subsequence  of  h  or  fewer  items  in  xq ,  xq.  . . . ,  Xt-i  hash  into  a  local  dependency  set  that  embeds 
a  key  in  L[£],  Then  we  formalize  the  notion  of  a  witness  set  (Definition  16),  which  will  comprise  (a 
maximal)  subset  of  D-S  that  includes  (among  other  keys)  all  subsets  that  will  (or  might)  cause  the 
vacancy  condition  to  be  false.  The  probability  that  our  weakened  vacancy  criterion  holds  can  then 
be  estimated  by  a  summation  (equation  (22))  over  all  subsets  of  D  -  S  of  the  probability  that  each 
subset  is  a  witness  set  that  does  not  contradict  the  vacancy  statement,  and  this  summation  could, 
in  principle,  be  summed  (in  equation  (2))  to  give  an  expression  that  overestimates  the  insertion  cost 
for  xan.  Rather  than  evaluate  such  a  hopelessly  complicated  summation,  we  show  that  the  sum  is 
asymptotically  the  same  for  uniform  hashing  and  double  hashing  with  full  independence  (Lemma 
6). 

Inclusion-exclusion  is  used  to  extend  the  result  to  double  hashing  with  limited  independence. 
We  also  have  to  establish  a  bound  that  guarantees  that  the  witness  set  has  a  size  that  is  proportional 
to  log  re,  with  overwhelming  probability  (Lemma  7). 

Lastly,  Theorem  3  shows  that  for  uniform  hashing,  the  explicit  vacancy  estimate  given  for  q 
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satisfies  the  multiplicativity  criterion  used  for  our  estimate  in  Theorem  1,  which  therefore  holds, 
and  provides  an  evaluation  of  our  more  complicated  performance  summation  under  all  models. 

2.0  Generic  probe  counts 

Since  our  probe  formulations  are  based  on  graphs  that  capture  all  essential  collision  behavior,  a  few 
preliminary  definitions  would  seem  to  be  appropriate. 

2.1  Basic  definitions 

Definition  3. 

•  The  hash  keys  D  =  (sq,  x2, . . . ,  xQn )  comprise  a  sequence  of  an  distinct  items,  a  <  1,  belonging 
to  the  universe  U  =  {0, 1, . . . ,  m  —  1},  and  p(x,j )  :  U  i— ►  [0,n  -  1]  denotes  the  j'th  probe  for  key 

x. 

•  The  ith  element  in  a  sequence  S  is  denoted  by  S{.  For  D,  we  also  have  Dt  =  xt. 

•  The  random  variable  giving  the  number  of  probes  needed  to  insert  aq  is  defined  to  be  probe 
The  randomness  is  due  to  the  randomness  in  the  hash  functions  as  opposed  to  the  data. 

•  A  rooted  DAG  is  a  directed  acyclic  graph  with  only  one  root,  (i.e.  one  vertex  with  inclegree  0). 

•  Let  dgr(G)  of  a  rooted  DAG  G  be  the  outdegree  of  the  root  of  G. 

•  Let  x  is  embedded  at  location  £  mean  that  as  a  consequence  of  hashing  D  into  table  L,  L[£]  =  x. 
We  extend  this  notion  to  include  cases  where  only  a  subsequence  S  c  D  of  the  data  is  hashed, 
in  which  case  D  should  be  replaced  by  S,  and  the  embedding  assignment  should  be  understood 
to  be  possibly  incorrect,  when  all  of  D  is  processed. 

We  will  be  analyzing  how  D  is  hashed  into  a  table  L  of  size  n  under  three  models:  uniform  hash¬ 
ing,  a  generalization  of  double  hashing  where  random  hash  functions  are  used,  and  the  same  double 
hashing  model  where  the  hash  functions  are  constructed  from  a  family  of  (^)-wise  independent 
family  of  universal  hash  functions. 
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Definition  4:  The  models  UH1  DH,  and  DH \p. 

•  In  UH ,  the  probe  sequence  p(x,  *)  is  an  independent  family  of  random  variables  that  are 

uniformly  distributed  over  [0,  n  -  1].  Any  collection  of  sequences  p(aq ,  *), . .  ,,p(xn,*) 

are  mutually  independent,  for  distinct  x,r 

•  DH  relaxes  the  requirement  that  each  individual  probe  sequence  be  fully  random. 

1.  Each  probe  sequence  p( 2?,*)  exhibits  approximate  pairwise  independence: 

Vx  Vi,j  i  £  j  Vr,  se[0,n-l]rjts:  Prob{p{x, 1 )  =  r,  p(x,j)  =  sj  =  (w_^(1))2. 

2.  Furthermore,  the  random  sequences  {p(x,  *)}xe£i  are  mutually  independent.  This  condition 
need  only  hold  for  a  subset  of  hash  functions  F  c  F,  where  F  depends  on  D,  and  > 

1_2Q1 

3.  In  addition,  we  have  the  following  robustness  requirements. 

i)  Extremely  long  probe  sequences  are  quite  rare:  For  a  fixed  c0  that  depends  on  a, 

\fx  :  ^2  Prob{\  U*=1  {p(x,  *)}|  <  an  +  1}  <  ^p-. 

t>c0n 

ii)  Probe  sequences  are  unlikely  to  reprobe  locations  too  frequently. 

Vx  yj  <  k  <  h.  r  e  [0,  n  -  1]  :  Prob{p(x,j )  =  p(x,  k),  p{x1  h)  —  r}  —  pjp- 

yx  yh  <  i..j  <  k ,  (h,  i )  ^  (j,  k)  :  Prob{p(x,  h)  =  p{x,  i ),  p{x,j)  =  p(x,  k)}  =  ^r1. 

•  In  DH .0,  the  statistical  probe  behavior  of  an  individual  probe  sequence  is  subject  to  the  same 
requirements  as  in  DH  for  the  first  ip  probes,  the  global  coverage  requirement  must  still  hold, 
and  the  joint  distribution  of  initial  probe  sequences,  for  collections  of  ip  or  fewer  probes,  is 
required  to  be  statistically  independent,  for  distinct  items.  More  precisely,,  we  have  the  follow¬ 
ing. 

1.  Vx  yij  <  ip  i  F  j  Vr.  *  e  [0,  n-l\r£s:  Prob{p(x ,  *)  =  r,  p(x.  j)  =  5}  =  (k_01(1))2  ■ 

2.  Knowing  a  limited  number  of  the  probe  values  for  a  small  set  of  keys  gives  no  information 
about  the  first  few  probes  for  another  key.  Formally,  let  Z  be  a  set  of  keys  (  e  U  with 
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associated  probe  count  bounds  where  <  V’-  Let  •> -3  K(,j^}(eZ  be  a 

multiset  of  arbitrary  probe  locations.  Then 

Prohi  A  A  AC,i)  =  «c,i)  =  II  Probi  A  AC, i)  = 

CeZj<j(  CeZ  j<j( 

This  condition  need  only  hold  for  a  subset  of  hash  functions  F  c  P,  where  F  depends  on 
D,  and  jjj  >  1  - 

3.  For  some  fixed  cq  that  depends  on  a,  Vx  :  J2t>c0n  Prob{\  U*=1  (p(x\  *)}|  <  an  +  1}  <  T. 

4.  Vx  yj<k<  h  <tf),  r  e  [0,  n  -  1]  :  Prob{p(x,j )  =  p(x.  k),  p(x,  h)  =  r)  = 

5.  Vx  V/i.  <  i  <  Vb  J  <  k  <  Vb  (h,  *)  F  O',  *)  :  Pro6{p(x,  h)  =  p(x,  *),  p(x,  j)  =  p(x,  k)}  = 

The  requirement  that  r  ^  s  is  explicitly  included  in  1  of  DPI  to  ensure  that  standard  double 
hashing,  which  suffices  to  guarantee  that  p(x,*)  be  a  permutation,  belongs  within  DH.  Similarly, 
double  hashing  and  all  of  DPI  are  included  in  DP[.^, . 

Our  robustness  requirements  replace  the  (stronger)  requirement  that  probe  sequences  be  permu¬ 
tations.  It  is  easily  seen  that  some  form  of  robustness  is  necessary  to  guarantee  that  hash  functions 
do  not  fail.  Consider  the  damage  that  would  occur  if  the  offset  function  d(x),  for  standard  double 
hashing,  were  allowed,  for  example,  to  be  zero  even  with  the  tiny  probability  1/n3:  with  probability 
1/n3  the  number  of  probes  needed  to  insert  xt  becomes  cc  and  so  does  the  expected  probe  count. 
Such  degenerate  functions  must  therefore  be  excluded  from  DPI  and  DH ^ . 

Since  we  are  using  finite  classes  of  hash  functions,  a  single  defective  function  can  place  an 
otherwise  efficient  algorithm  outside  DH  or  DH On  the  other  hand,  we  may  include  such  classes 
in  DH^  by  modifying  the  hashing  procedure  when,  say,  an  item’s  first  n1/3  probes  (or  0(n))  have 
failed  to  find  a  vacant  location.  A  suitable  strategy  would  be  to  switch  to  linear  probing  (where 
p(x,j)  —  f(x)  -  j  +  1  mod  n.  for  a  random  /),  which  would  reduce  the  probability  of  failure  to  zero, 
and  satisfy  our  global  robustness  requirement.  We  could  even  set  /  =  0,  in  this  case.  Alternatively, 
one  could  select  new  random  seeds  and  rehash  the  entire  data  set. 

The  independence  ?/>  needed  for  these  proofs  will  turn  out  to  be  O(logra).  An  immediate 
consequence  of  this  work  is  that  standard  double  hashing  will  achieve  near  optimal  performance,  if 
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for  example,  the  probe  and  offset  functions  f,d  are  chosen  from  an  effectively  'd'-wise  independent 
family  of  hash  functions  with  tolerance  dg-.  For  example,  an  adequately  independent  family  is 
given  by  /  e  o  g:  and  (d  -  1)  e  $&  n-\  °  9->  where  g  is  a  random  function  in  Fq.  Subject 
to  the  caveats  needed  to  ensure  the  absence  of  failure,  uniform  hashing  will  achieve  near  optimal 
expected  performance  for  the  probe  functions  p(x,j )  =  f(j  +  ng(x)),  for  g  e  Ffi ,  and  f  e  F^,  n,  where 
F^,n  ■  [0,-nfc+1]  ^  [0,r-1],  for,  say  k  —  4  as  in  Section  1.1.  The  function  families  iG  n.  F ^  n_1:  and 
F^,  n  could  be  the  universal  hash  functions  presented  in  Section  1.0,  or  the  constant  time  functions 
of  [16]. 

We  have  already  seen  that  the  presence  of  irregular  hash  functions,  such  as  the  small  1/n2 
fraction  in  Fq  that  have  collisions  on  D  are  insignificant.  We  now  drop  all  reference  to  them  since 
they  induce  a  probe  cost  of  0(l/n). 

To  analyze  how  the  ordered  data  set  D  hashes  into  the  table,  we  introduce  a  family  of  directed 
graphs  to  capture  the  structure  of  the  collision  events. 

Definition  5:  The  dependency  graph  G(D). 

Given  a  sequence  D  of  hash  keys,  we  say  that  a  hashing  of  D  defines  a  directed  dependency 
graph  G(D)  as  follows.  The  vertex  set  of  G(D)  is  D  and  the  edge  set  is  initially  empty.  Suppose 
that  when  inserting  x  into  the  hash  table,  x  is  placed  in  its  k- th  probe  location  4,  (after  probing 
ll, . . lje_  j).  We  add  a  directed  edge  from  x  to  each  of  the  items  xp. . . . . ,  residing  in  table 

locations  li,  ■  ■  -  ffk-l-  Each  edge  is  labeled  with  its  corresponding  probe  number,  and  each  vertex 
xT  e  D  bears  its  label  T,  which  is  its  position  in  the  sequence  D. 

Notice  that  G(D ),  despite  extensive  labeling,  bears  no  information  to  indicate  where  nodes  are 
embedded. 

Definition  6:  The  dependency  graph  G(x,  D )• 

•  The  dependency  graphoi  x  in  D,  G(x.  D ),  is  the  restriction  of  G{D)  to  the  vertex  set  comprising 
x  and  all  nodes  reachable  from  x  in  G(D).  Its  edges  and  vertices  are  both  labeled. 

•  The  dependency  set  of  x,  dep(x7  D),  is  defined  as  the  vertex  set  of  G(x,D). 
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It  is  also  convenient  to  refine  these  objects  based  upon  intermediate  events. 

Definition  7:  Partial  dependency  graphs  Gr(x,D). 

•  The  partial  dependency  graphs  of  xl  are  the  probe t  subgraphs  of  G(xpD),  in  which  x.l  is 
restricted  to  prefixes  of  its  probe  sequence:  Go(xpD)  c  G\(xpD)  c  ■  ■  ■  Gproj)e._i(xll  D)  = 
G(x’j,  D).  Gr(x^D)  is  composed  of  xl.  the  edges  corresponding  to  the  hrst  r  probes  of  aq  and 
the  restriction  of  G(D)  to  the  vertex  set  reachable  from  xt  by  these  r  probes.  We  note  that 
the  graphs  Gr(x,D)  and  Gr+i(x,  D)  might  have  the  same  vertex  set  and  only  differ  in  the 
outdegree  of  the  root  x  in  the  graph. 

•  The  vertex  set  of  Gr(x,D)  is  denoted  by  depr(x,D). 

The  set  of  all  partial  dependency  graphs  of  x  is  denoted  by  Q*(x,D).  The  vertices  in  each  of 
these  graphs  have  labels  in  [1,  |D|],  We  may  relabel  the  vertices  of  any  G  —  ( V. ,  E ),  G  e  g*(a,  D ),  in 
the  unique  order  preserving  way  to  1,2,...,  |Vj.  Let  Q(x,  D)  be  the  resulting  set  of  relabeled  graphs. 
Clearly  \g(x,D)\  =  \g*(x,D)\. 

These  definitions  provide  immediate  formulations  for  the  expected  number  of  probes  to  insert 
x.t ,  as  a  function  of  its  dependency  graph. 

We  will  count  the  expected  number  of  partial  dependency  DAGs  rooted  at  aa„,  which  means 
that  root  xan  may  not  yet  have  found  a  vacant  table  slot  for  insertion.  Thus  the  next  probe,  on 
behalf  of  xan ,  will  add  another  branch  to  the  DAG,  if  the  new  slot  turns  out  to  be  occupied.  Let 
xan  have  r  children  in  the  DAG  G(xan,  D).  Then  it  will  have  encountered  r  +  1  DAG  s.  (The 
hrst  will  have  zero  children  since  we  do  not  require  the  root  to  be  inserted  when  counting  these 
structures.)  Thus  the  number  of  such  DAGs  actually  encountered  by  xan  is  precisely  the  number 

of  probes  needed  to  insert  the  key. 

E [probej]  =  ^  Prob{dgr{G(xpD ))  >  k} 
k>  o 

=  E[\g(xuD)\]. 

To  estimate  the  probability  that  a  given  labeled  graph  G  with  k  vertices  is  in  g(x^  D ),  we  can 
sample  all  subsequences  S  of  (aq,  x2,  •  ■  ■ ,  aq_i )  with  k-  1  vertices  and  map  them  as  prescribed  to 
the  vertices  of  G.  We  may  then  evaluate  the  probability  that  the  collision  behavior  of  these  vertices 
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is  exactly  as  prescribed,  and  that  the  chosen  subsequence  is  the  right  one  (i.e.  the  elements  in  S 
end  up  in  the  same  locations  regardless  of  whether  only  S  or  all  of  D  is  hashed).  The  probability 
of  the  latter  event  is  clearly  the  more  difficult  to  estimate,  since  it  concerns  all  of  D.  Moreover, 
estimating  the  probability  that  a  sequence  is  the  right  one  involves  estimating  the  probability  that 
certain  locations  are  not  occupied  at  certain  times,  which  is  our  original  problem.  There  is,  however, 
one  important  difference:  we  may  overestimate  the  probability  that  a  sequence  is  the  right  one  by 
underestimating  the  probability  that  the  locations  in  which  a  candidate  subsequence  is  embedded 
are  full.  The  resulting  expected  number  of  such  S  gives  an  overestimate  for  E \probet], 

A  key  to  determining  a  window  size  (of  subsequences  to  examine)  is  to  find  a  minimal  sized  h: 
Prob{\dep(xan,  D)\  >  h}  <  1/n2.  Pursuant  to  this  objective,  we  have  the  following. 

Definition  8:  The  probabilities  P(k,j )  and  P(k,j). 

•  Let  P(k,j )  be  the  probability  that  a  partial  dependency  graph  of  Xj,  (the  jth  item  to  be 
hashed),  contains  exactly  k  vertices:  P(k,j)  =  Prob{\depT(x v  D)\  =  k},  for  some  r. 

•  Let  P{k,j )  be  the  expected  number  of  partial  dependency  graph  of  x,j  that  contain  exactly  k 
vertices:  P(k,j)  —  Prob{\depT(xj,  D)\  —  k}. 

The  technical  reason  for  defining  the  P(k,j)  as  an  expected  number  as  opposed  to  a  probability 
is  that  a  single  dependency  graph  may  have  two  partial  dependency  graphs  with  the  same  number 
of  vertices,  due  to  a  collision  between  Xj  and  some  node  already  within  its  dependency  graph. 
Moreover,  we  now  have  the  following  formulation,  which  expresses  E \probej]  as  a  function  of  its 
partial  dependency  graphs. 

Lemma  1.  E [probe j\,  the  expected  number  of  probes  needed  to  insert  the  jth  element  Xj.  equals 

Proof: 

YlP(k,j)  =  E[\6(xi,D)\].  | 

0  <k 

Unfortunately,  P(k,j )  may  be  a  little  unruly  in  DPI ,  because  of  the  possibility  of  reprobing 
earlier  probe  locations.  Accordingly,  we  account  for  such  events  as  follows. 
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Definition  9. 

Let  Err0(j )  be  the  expected  number  of  probes  from  Xj  to  vertices  already  belonging  to  partial 
dependency  sets  of  Xj\  Err0(j )  =  J2r=2  3  Prob{depT(xj,  D)  =  depr_^(Xj.  D)\. 

Lemma  1  may  now  be  restated  as  follows. 

Corollary  1. 

E [probe^  =  Err0(j )  +  P(£,j).  I 

0<£ 

Remark  1. 

Note  that  the  probabilities  P(I.  j)  (and,  in  fact,  all  performance  statistics)  are  defined  with 
respect  to  a  universal  class  of  hash  functions.  Now,  these  probabilities  are  not,  of  course,  exactly 
the  same  for  all  classes  in  DH.^,  or  all  classes  in  DH .  However,  since  UH  c  DEI  c  DH Corollary  1 
shows  that  DH ^  would  be  guaranteed  to  provide  optimal  performance  if  P(t.  j )  and  Err0(j)  were 
shown  to  be  asymptotically  the  same  for  all  families  DH In  the  following  subsection,  we  establish 
that  P(k ,  k)  is  indeed  essentially  the  same  for  all  members  of  DH when  ip  >  3 k. 

2.1.1  The  importance  of  P(k}k ) 

The  value  P(k,k )  is  of  special  interest  because  the  event  ( \dep(x,  D )|  =  k)  corresponds  to  the 
existence  of  a  subset  6  c  D  of  k  items  x  e  6,  which  has  the  collision  behavior  \dep(x,6)\  =  k. 
Consequently,  6  is  the  dependency  set  dep{x,  D)  for  data  the  subset  D  —  6.  The  event  dep(x ,  6)  —  6, 
as  the  next  lemma  will  show,  is  nearly  independent  of  the  hashing  scheme  and  specific  items  being 
inserted.  But  before  analyzing  the  probability  distribution  of  dependency  DAGs,  we  need  a  standard 
traversal  procedure  to  extract  unique  spanning  trees  from  each  DAG. 

Definition  10. 

Given  a  DAG  G  —  (V,E),  let  its  spanning  tree  be  constructed  according  to  the  following  process. 
Its  vertices  are  scanned  in  order  of  decreasing  index  in  D.  When  a  vertex  x  is  scanned,  its  children 
are  immediately  processed  in  order  of  decreasing  probe  count,  so  that  the  vertex  in  x’a  first  probe 
location  is  processed  after  all  of  its  siblings.  The  tree  edges  out  of  x  will  comprise  the  edges  to  x's 
previously  unprocessed  children. 
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Lemma  2.  In  DHj,,  for  tjj  >  3k,  P(k,k)  is  within  a  factor  of  (1  +  0(k3/n ))  of  the  same  value,  for 
all  fc-tuples  of  distinct  items  in  D  and  any  family  that  satisfies  the  requirements  of  DH j,.  More 
precisely,  let  G  be  a  dependency  tree  of  k  vertices,  and  let  S  =  (Si,  S2  ■  ■  ■  Sj-)  be  a  sequence  of  k 
distinct  elements  in  U.  Then  for  k  —  0(??d/3), 

1)  The  probability  that  the  dependency  graph  G(S)  —  G  under  DH.^ ,  for  tf>  >  3k,  is  ■ 

2)  The  probability  that  G(S)  is  a  rooted  dependency  DAG,  that  properly  contains  the  structure 
G  as  its  spanning  tree  with  root  S^.,  under  DH for  tf)  >  3k,  is  bounded  by  0(k3/nk). 

Proof:  Let  G(S )  =  (V,E),  where  |V|  =  k.  Let  the  sequence  S  be  (h’j ,  S-2 . cq,),  be  the 

order  the  vertices  are  processed  as  genuine  children  in  our  spanning  procedure,  with  the  root  placed 
first.  Let  Tr  —  (V,Ej'r)  be  the  tree  discovered  by  the  search.  We  embed  the  vertices  in  the  order 
of  exploration,  root  first. 


1 )  Suppose  that  G(S)  is  a  tree,  whence  Epr  =  E.  If  the  root  Si  has  h  children  then  5q  is  embedded 
in  its  h  +  1st  probe  location,  which  is  any  one  of  n  locations.  The  h  children  correspond  to 
the  h  items  £q  encountered  when  it  was  inserted.  The  probability  that  these  first  h  +  1  probes 
are  to  distinct  locations  is  between  1  and  1  -  0(h? /n).  Subsequent  node  embedding  will  have 
two  constraints:  a  node  with  h  children  has  its  (h .  +  l)-st  probe  location  predetermined,  and 
the  first  h  probes  must  be  to  h  distinct  unembedded  locations.  Let  Rj  be  the  set  of  locations 
used  for  the  tree  node  destinations  that  are  specified  prior  to  the  placement  specification  for 
the  children  of  node  Sj,  and  let  r.j  be  the  location  for  Sj. 

The  probability  that  a  node  Sj  with  hj  children  hashes  to  meet  these  two  constraints  is: 


Prob  < 


(p(Sj,  hj  +  1) 


A  (p(S3,z)^p(S3,£)),  A  (p(S3,i)4R3)  , 


^  1  <i<£<hj  1  <i<hj  J 

which  is  at  most  Prob{(p(Sj,  hj  +  1)  =  rj)}  <  1+°i1/n) ;  anc[  at  least 

Prob\(p(Sj,  hj  +  1)  =  rj)|  -  ^  Prob{(p(Sj,  h}  +  1)  =  rj),  (p(Sj,i)  =  p(Sj,t ))} 

1  <i<£<hj 

-  J2  Proh{  (p(  S j ,  hj  +  1 )  =  r j ) ,  ( p(  Sj,i)  =  r ) }  , 

l<i<hj 
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which  is  bounded  below  by  - -  We  used  our  assumption  of  local  robustness 

to  derive  the  second  term  and  pairwise  independence  to  derive  the  third  term.  We  appeal  to 
the  independence  of  individual  probe  sequences  to  multiply  all  k  factors  to  get  a  value  between 
(1  -  and  (l)fc_1(l  +  0(k/n)),  which  proves  1). 

2)  If  G(S)  is  not  a  tree  then  E  A  Exr  and  the  nodes  of  Tr  have  different  embeddings  since 
collisions  occurred.  The  tree  construction  is  similar,  but  some  nodes  x  e  Tr,  will  have  gaps 
in  their  probe  sequences  p(x.  1 ).p{x,  2), ...  to  their  tree  children,  since  edges  to  nodes  that  are 
already  embedded  or  that  have  embedding  locations  already  specified  will  be  omitted.  Now,  the 
initial  probe  sequences  for  any  k  items  are  mutually  independent,  as  long  as  the  total  number 
of  probes  is  bounded  by  if).  Consequently,  the  probability  that  V  hashes  into  a  DAG  that  yields 
Ej'j.  as  its  spanning  tree  is  at  most  Y^j=i  Pr r  where  pry  overestimates  the  probability  that  the 
j-th  vertex  is  hashed  to  have  the  correct  probes  to  previously  determined  locations. 

Let  Sj  have  hj  tree  edges.  To  upper  bound  p/y ,  we  distinguish  among  three  cases:  Sj  has 
no  non-tree  edges,  Sj  has  fewer  than  hj  +2  non-tree  edges  and  at  least  one,  and  Sj  has  at  least 
h.j  +  2  non-tree  edges.  Note  that  if  no  two  locations  can  be  probed  twice  in  a  probe  sequence 
-  as  is  the  case  in  double  hashing  -  then  cases  two  and  three  combine  into  the  case  S j  has  at 
least  one  and  at  most  k  -  1  non-tree  edges. 

The  first  case  is  as  the  overestimate  in  1),  and  contributes  a  probability  of  at  most  1  to 
pry,  and  at  most  L(1  +  0(1 /n))  to  pry ,  for  j  >  1. 

In  the  second  case,  there  are  different  DAG  structures,  depending  on  which  probe  count 
within  ( h.j  +  2, . . . ,  2 hj  +  2)  is  the  last  and  actually  embeds  Sj.  Summing  over  all  possible  last 
probe  counts,  over  the  possible  probe  counts  that  correspond  to  the  first  non-tree  edge,  which 
is  among  the  first  hj  +  1  probes,  and  the  set  of  possible  destinations  for  this  first  non-tree 
edge,  (which  must  be  to  a  location  already  probed  by  Sj  or  some  other  item  in  S ),  we  get 
0((hj-\-\)tJij-\-\)k)  ^  a| !  overestimate  for  the  probability  contributed  to  pry  by  case  2. 

In  the  third  case,  there  must  be  two  consecutive  non-tree  edges  among  the  first  2 hj  +  2 
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probes  of  Sj.  These  edges  may  go  to  previously  embedded  items  or  collide  with  earlier  probes 
of  Sj.  To  estimate  this  contribution  to  ptj .  we  ignore  the  requirement  that  Sj  must  be  placed 
successfully  and  focus  on  the  expected  number  of  ways  a  first  pair  of  such  probes  could  occur, 
which  is  bounded  by  (2 hj  +  1 )  °^2  ^ . 

Combining  like  terms  from  the  three  cases  into  factors  and  multiplying  gives 

n^=  n  (i+o(ih±iiT))=(i)n(1+o(t»), 

1  <j<k 

and  hence  the  probability  that  G  results  from  the  traversal  of  a  non-tree  DAG  is  at  most 

+  o< *»/"))  -  (i  -  Sla)*-1  =  o(ty «‘)  ■ 

Notice  that  we  have  used  the  pairwise  independence  of  probes,  the  independence  of  probes  for 
k  different  items,  the  local  robustness  requirements  that  restrict  an  item’s  probe  sequence  from 
excessive  reprobing  of  previously  tried  locations,  and  have  assumed  that  the  insertion  procedure 
did  not  fail.  The  total  number  of  probes,  which  governs  the  independence  ip  as  defined  in  Definition 
4  is  less  than  3 k.  Even  sharper  bounds  can  be  attained  (more  naturally)  for  true  double  hashing 
and  for  uniform  hashing,  but  such  results  cannot  improve  our  asymptotic  efficiency  results. 

In  view  of  Lemma  2,  we  need  to  examine  the  hash  statistics  associated  with  trees  in  greater 
detail.  Accordingly,  we  have  the  following  definitions. 

Definition  11. 

•  Let  N(k)  be  the  number  of  distinct  ordered  (dependency)  trees  that  can  occur  with  k  vertices. 

•  Let  Ptree(k,j )  be  the  probability  that  some  partial  dependency  graph  of  Xj  is  a  tree  of  k  nodes. 

Remark  2. 

Lemma  2  shows  that  for  any  k  element  subset  6  =  (<5l5 . . . ,  6 ^)  c  D,  where  k  = 

(a)  P(k ,  k)  =  Prob{dep{8kl8 )  —  <*)}  =  n~k+1  N (k)(  1  +  0(k3 /n)) 

(b)  Pme(k,k )  =  ,/  k+P\(k)[\-0(k- /,<)). 

The  next  step  is  to  formalize  these  remarks  and  to  introduce  Err{k,j ),  a  bound  that  will  replace 
Err0(j),  in  the  formula  of  Corollary  1,  and  include  a  truncation  error  that  permits  the  P(£,j)  to 
be  summed  through  the  first  k  terms  only. 
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Definition  12. 

•  Redefine  P(k ,  k )  to  be  n~k+1N(k)(  1  +  0(k3 /n)). 

•  Let  Pfree(k,  k )  =  n~k+1  N  (k){  1  +  0(k2 /n)). 

•  Let  Erri(k,j)  be  the  probability  that  the  vertex  count  \dep(Xj,  D)  I  >  k. 

•  Let  RR(k,j)  be  the  Boolean  indicator  function  for  the  event 

(\dep(xv  D)\  <  k  and  some  unsuccessful  probe  for  Xj  does  not  increase  the  size  of  its  partial 
dependency  set): 

RR(k,j )  =  (| dep(xj,  D)\  <  k)  a  ( depT(xj ,  D)  —  depr+i(xj ,  D)  for  some  r  <  probej  -  2). 

Let  Err2{k,j)  =  2E[\dep(xj,  D)\  x  so  that  we  take  a  penalty  of  2\dep(xj,  D\  probes 

when  Xj  has  a  dependency  set  of  size  k  or  less,  x.j  has  a  directed  edge  to  X£  and  the  indegree 
of  X£  is  greater  than  1,  in  G(xj ,  D). 

•  Let  Err^(k:j)  be  the  probability  that  \dep(xj,  D)  I  <C  Jc  and.  Xj  has  at  least  2| dcpi^Xj^  ,Z)^|  probes 
to  dep(xj ,  D). 

•  Let  Err^ij)  =  Yft>c0n  Pr°b{pr°be]  >  t},  where  c0  is  used  in  Definition  4  for  DPI  and  DH^,. 

•  Let  Err(k,j )  =  c0nErr1(k,j)  +  Err2(k,j)  +  c0nErr3(k,j)  +  Err^(j). 

It  is  easy  to  see  that  Err^(j)  —  J2t>c0n(t^~  1  - c0n)Prob{probj  =  t) .  so  that  this  error  is  the  expected 
number  of  probes  beyond  c0n  -  1.  The  expected  number  of  excess  probes  among  the  first  c0n  are 
overcounted  by  the  three  other  terms  comprising  Err(k,j). 

Corollary  2.  For  ip  >  3k  —  ()[n  '/a). 

E [probe j\  <  Err(k,j )  +  E 

1  <i<k 

Proof:  In  view  of  Lemma  1,  we  need  only  show  that  Err0(j)  +  J2i>k  P{hj)  <  Err(k,j).  But 
this  follows  from  the  definition  of  Err(k,j).  1 

The  following  definitions  will  enable  us  to  formulate  P(k,  an)  as  a  function  of  P(k,  k). 
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Definition  13. 

Let  I  be  a  sequence  of  k  distinct  locations  in  our  hash  table.  Let  T  be  an  increasing  sequence 
of  k  indices,  Tj,  <  an,  with  corresponding  items  Dj~  in  the  ordered  data  set  D. 

•  Define  locjj(Dj')  to  be  the  sequence  of  table  indices  occupied  by  Dj~  when  D  is  hashed  into 
L.  Let,  for  a  data  sequence  D‘ ,  loc{D')  —  locjj^D1),  so  that  loc  without  a  subscript,  takes  its 
argument  to  be  the  complete  sequence  being  hashed. 

•  Let  M(T ,  /)  be  the  event:  all  of  D  can  be  successfully  hashed  into  L  according  to  the  following 
modified  hashing  process:  for  j  ^  T,  Xj  is  hashed  according  to  its  specified  probe  sequence;  for 
j  —  lj  2,  ■  •  • ,  |T|,  x^j.j  can  be  placed  in  the  (formerly  vacant)  location  L[Ij]  without  concern  for 
the  probe  sequence.  If  some  L[I.j\  turns  out  to  be  already  occupied  at  time  T.j,  then  M(T,I ) 
does  not  occur. 

Thus  .1/(7',  /)  depends  on  I,  T  and  D  -  Dj>,  but  is  independent  of  the  values  comprising  Dj*. 
Simply  stated,  (locjj(D^)  —  I)  —  ( ( loc(  />/  )  =  /)  A  M(T,  /)).  In  models  UH  and  DH  the  events 
(loc(Dj')  —  /)  and  M(T,I )  are  independent,  although  they  depend  on  I. 

•  Let  q(T,I)  denote  the  probability  of  M(T,I).  As  noted  earlier,  we  have  not  shown  yet  that 
the  probabilities  q(T,I)  for  different  families  in  DH.^,  are  very  close. 

•  Given  any  sequence  8,  let  8\\x  denote  the  sequence  8  with  x  appended  at  the  end. 

These  definitions  can  now  be  put  to  use  to  find  additional  formulations  for  the  expected  number 
of  probes. 


Lemma  3. 


In  UH,  for  any  fixed  I0  c  [0,  n  -  1],  with  |/0|  =  k  -  1: 
P(k,  an)  =  P(k,  k)  q{T,I0). 

Tc[l,an-1] 

\T\=k-l 

In  DH: 


P(k,an)  <  P(k,h)  y) 


Tc[l,on-l] 

\T\=k-l 


max  q(T,I ) 
/  C  [0...-1]  ' 

\I\=k-i 


(2) 

(3) 
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In  DU,!,  ,  for  3k  <  ip: 


P(k,an)  <  P(k,k)  V  max  Prob{M(T7I)\ 

rp  n  11  Jc[0,B-l] 

Tc[f  on-1]  m-k-l 
\T\=k-l  11  1 

( dep(xam  DT\\ctn  ) 

<  P(k,  k )  max  q^k{T,  /), 

Tc[l,an-1]  1 

|T|=fc-l  11 


=  D 


T\\an)  A  {loc(DT)  -  I)}, 


(4) 

(5) 


where  the  subscript  in  the  expression  q^,_ 3k  in  (5)  is  intended  to  restrict  numerical  computation  of 
q  to  inclusion-exclusion  calculations  that  use  no  more  than  (ijj  -  3&)-wise  independence. 

Proof:  In  all  three  models: 

P(k,  an)  =  Prob{(dep{xQn,Dnan)  =  DT^an)/\(loc(DT)  =  I)/\M(TJ )} 

\I\=k-i 
\T\  =  k-l 

<  Y.  ((  I]  Prob{(dep(xan,Dl^an)  =  DT]lan)A(loc(DT)  =  /)}) 

\T\  =  k-l  \I\=k-\ 

X  max  Prob{M(T,I)\  {dep(xan,  DT[lan)  =  DT||aB),  [loc{DT)  =  /)}) . 
\I\=k-l 


Now  E|7|=fc-1  Prob{(dep(xan,  DTllan)  =  DT^an)r\{loc{DT)  =  I)}  =  Prob{(dep(xan,  DTl{an)  = 
£>T||ok)}-  We  know  (Lemma  2  and  Remark  2)  that  in  all  three  models,  Prob{dep(xani  D^^an)  — 
DT\\an I  —  P{k,  k ),  for  all  D^^an,  and  the  same  (asymptotic)  equality  holds  if  we  add  the  restriction 
that  the  dependency  graph  has  no  more  than  3k  probes.  Inequality  (4)  now  follows. 

In  UP[,  the  event  M(T,I)  is  independent  of  (dep(xan,  DT\\an)  =  PT\\an)  A  (foc(_Dy)  =  and  is 
uniformly  distributed  over  all  sets  I  that  comprise  k  -  1  elements.  Hence  (2)  follows  with  equality. 

In  DPI ,  M(T,I )  depends  on  locations  I,  but  for  any  fixed  I,  is  also  independent  of 
{(dep(  xan,D^n)  —  D^n)  A  (Ioc(Dt)  =  /)},  since  hash  values  on  Dj*  do  not  disclose  any  information 
about  hash  values  on  D  -  Dj*.  Thus  (3)  follows. 

In  DPI ,0,  large  events  are  not  necessarily  independent,  but  our  estimates  of  the  conditional 
M(T,I )  will  be  based  on  windows  of  (/>  -  3\Dj~\  probe  events  for  keys  in  D  -  D y,  conditioned  on 
information  about  the  hash  function’s  behavior  on  DT.  Since  these  events  are  independent  of  the 
conditioning,  the  numerical  estimate  in  (5)  now  follows.  1 

Further  analysis  of  P(k,k )  will  show  that  P(k,an )  is  negligible  for  k  >  Clog?r,  for  a  suitable 
constant  C.  Similarly,  Err{k ),  will  turn  out  to  be  negligible.  As  a  consequence,  the  behavior  of 
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the  hash  function  can  be  analyzed  by  determining  P(k,k )  for  the  very  light  loads  k  —  O(log  n)  and 
estimating  q(T,I )  based  on  small  samples  of  points  in  D  -  Dj>,  for  small  \T\  —  O(logn). 

3.  Good  vacancy  estimators  and  their  generic  performance  equation 

Suppose  that  T  is  a  sequence  of  insertion  times  for  a  collection  of  keys  that  locally  hash  into  a 
dependency  graph.  This  dependency  graph  is  important  if  its  apparent  hash  locations  are  really 
empty  at  respective  times  T.  We  need  a  function  q(T )  that  overestimates  the  probability,  in  DH, 
that  these  |T|  locations  are  empty  at  times  T. 

Definition  14:  Multiplicative  vacancy  overestimators. 

•  We  call  a  Boolean  function  M'(T,  I)  a  vacancy  overestimator  if,  for  any  sequence  of  table 
locations  /  and  key  indices  T,  the  event  M(T1  /),  (that  locations  /  are  empty  at  respective 
times  T),  implies  the  event  M'(T.  I). 

•  We  call  a  function  q(t)  a  multiplicative  vacancy  overestimator  for  an  event  M'(T,I )  if  q  is 
decreasing  and  for  any  fixed  a  <  1,  some  bound  k,  and  for  all  D  :  \D\  =  an ,  T  c  [1,  an]  :  |T|  <  k. 
I  c  [l,n]  :  |/|  =  |T|,  the  following  holds: 

ma xProb{M'(T,  I)  \  [dep(xan)  =  DT]  A  [ loc(DT )  =  I)  ]}  <  ^1  +  ^  q(Tt). 

We  could  have  defined  weaker  multiplicative  overestimators  that  have  a  correction  factor  of  (1  + 
);  for  some  fixed  p  >  2  instead  of  p  —  2,  and  our  asymptotic  results,  it  turns  out,  would  be 
unchanged;  this  extra  freedom,  however,  appears  to  be  unnecessary. 

Of  course,  any  multiplicative  overestimator  for  an  event  M'  that  overestimates  M  is  also  a 
multiplicative  overestimator  for  M .  We  shall  eventually  take  the  bound  |T|  <  k  to  be  proportional 
to  logn,  but  shall  adopt  the  expedient  of  leaving  its  value  unspecified  as  long  as  possible.  In  any 
case,  (1  +  0(1  /n))q(t)  is  an  overestimate  of  the  probability  that  a  given  table  location  is  empty  after 
t  —  1  items  have  been  entered,  (since  we  may  choose  T  =  {f}).  Moreover,  such  q  s  do  exist:  q(t)  =  1  is 
certainly  a  (very  uninteresting)  multiplicative  overestimator.  Given  a  multiplicative  estimator  q  to 
overestimate  the  probability  that  a  given  dependency  graph  hashes  into  empty  locations  at  the  times 
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specified  by  T ,  Lemma  1,  Lemma  2  and  Corollary  2  show  that  the  expected  probe  count  for  Xj  can  be 
overestimated  by  a  computation  that  is  virtually  identical  in  U H ,  DH ,  and  DH^,  provided  that  for 
a  suitable  k  =  O(logn),  Err(k,j )  =  — so  that  the  computation  can  be  restricted  to  dependency 
trees  of  size  k  or  less.  For  DH the  independence  ip  will  be  required  to  exceed  3 k  +  s,  where  the 
s- wise  (conditional)  independence  of  the  probe  sequences  must  guarantee  that  the  behavior  of  q  is 
as  stated.  Until  these  values  are  quantified,  we  shall  expose  the  implicit  dependencies  by  writing  qs 
and  S|/|  when  appropriate. 

The  presentation  of  a  suitably  multiplicative  overestimator  q  is  technical,  and  will  be  deferred 
even  further.  Meanwhile,  the  reader  may  prefer  to  view  the  following  development  as  if  it  were  for 
UH ,  although  the  conclusions  will  apply  to  I)  H  and  DH ^  as  well. 


Corollary  3.  Let  qs  be  a  multiplicative  vacancy  overestimator  for  UH ,  DH,  or  DH \p,  where 
sk  +  3 k<ip,  and  put  |Eo<i<a  <ls(b)  =  Qs(a).  Then  for  k  =  O^n1/ 3), 


U  /-(  ;<  ;  (  k  -  O //  )  ^  (1  T 


<(1  + 


0{k2). 


k- 1 


n 


E  Ptree(k,k)]Jqs(Tt) 
\T\-k-l  i=l 

Tc  [l,an-l] 


0(k 2) 


n 


Ptree  {k,  k) 


n 


k- 1 


Qs(cm) 


k- 1 


(6) 

(7) 


Proof:  Lemmas  2  and  3  show  that 


Ptree  (^5  CKTz)  (  1 


0(U) 


n 


)Ptree(k,k)  Yj  max jqs(T ,  I). 

Tc[l,on-l] 

\T\=k-l 


Inequality  (6)  follows  from  Definition  14,  which  requires  that  ( 1  +  0( \T\2)/n)  hh  qs[Tp)  overestimate 

k- 1 


qs(T ).  Similarly,  (7)  is  an  immediate  consequence,  si 


1 


'  an- 1 


since 


E 


includes,  for  all 


(*-!)!  [t=i 

subsequences  T  c  [l,an]  with  (|T|  =  k  —  1),  each  of  the  products  n i  (h(Pi)-,  exactly  once.  | 
Corollary  3  provides  a  way  to  estimate  Piree{k,an)  and  hence  E \probean]  from  a  vacancy  esti¬ 
mator  qs(t). 


Theorem  1.  Let  qs  be  a  multiplicative  vacancy  estimator,  for  UH,  DH,  or  DH where  sk+3k  <  ip 
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and  k  =  Ofa1/3).  Let  I J20<b<aqs{b )  =  Qs{a),  Then 

E \probeQn\  <  ,  ^  +  0{  — )  +  Err(k ,  an). 

^1-2  Qs(an)  n 

Proof:  By  Lemma  2, 

Ptr„(k,k)  =  +0(k*/n)),  (8) 

where  N(  k)  counts  the  number  of  different  dependency  trees  with  the  k  vertices  aq,  x2,  ■  ■  ■ ,  x^..  To 
determine  N (k ),  we  observe  that  the  root  of  the  dependency  trees  is  fixed  at  x the  vertex  with 
highest  index.  Both  the  rightmost  subtree  of  x^.  as  well  as  the  tree  consisting  of  root  Xj.  plus  the 
remaining  vertices  comprising  its  other  subtrees,  if  any,  constitute  partial  dependency  trees.  If  the 
rightmost  subtree  contains  j  vertices,  its  elements  can  be  chosen  in  ways.  This  results  in  the 

following  recurrence  equation  for  N\ 


JV(1) =  1 


1< 

which  upon  setting  j'  =  k  —  j  gives: 


k-  1 


W)=  E  (" ,'WiW-A 


l<j‘<k-l  J 


Averaging  these  two  formulations,  and  applying  the  equality  (j_j)  +  (^j\)  —  (j)  gives: 


N(  1)  =  1, 

JV(fc)  =  E  * 


>  1. 


Let 


1  <  j  <  A;  —  1 


!)(■>-)  =  E  p-x* ■ 

0  <k 


Then  multiplying  (9)  by  xk,  summing,  and  applying  (10)  gives: 

g(X)  -  -  E  E  f r  N(f,k  ~A^xi  =  g2(x)/2  +  xi 

k>n<j<k-i  J ■ 


and  hence 


g(z)  =  1  -  =  1  -  £  C{2)(-2*) 

k> o  v  ' 


Equating  coefficients  of  xk  in  (10)  and  (11)  gives: 


(9) 


(10) 


(11) 
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N(k)  =  kl^k2y(-l)^  =  (k  -  1)!  (-2)*-1. 

Substituting  for  N  (k)  in  (8),  and  using  Corollary  3  to  define  P(k,an )  in  terms  of  Ptree(k,k)  and 
Qs(an)  gives: 

s  (i-D  ('  +  w?)  =  (i-D  f1  +  w?)  (-aw™))*-1  •  (i2) 

Summing  the  P(i,an),  as  prescribed  by  Corollary  2  gives: 

E \probean\  <  ^(1  +  ~~)(l  ( -2(5s(an))*_1  +  Err(k1an) 


i=i 


< 


1 


+ 


E(^i)(*i/i2)  (-2 Qs(an))'  1  +  Err(k,an). 


(13) 


'1  -  2 Qs(an) 

The  error  attributable  to  non-tree  DAGs  is  seen  to  be  bounded  by: 


(_2  Qsian))1-1  <  V-  (77/f)  (  ■2Qs(cm))t  '  =  0( - (^(^))3  y  (14) 

^  n  \i- l)[  >>  ~  ^  ft  V*-4;1  V«(l -2gs(an))7/2;  v  ; 

We  conclude  that  for  fixed  a  <  1,  and  2Qs(an)  bounded  by  some  fixed  value  less  than  1, 


E \probean\  <  7  +  0(— )  +  Err(k , an).  | 

^1-2  Qs(an)  n 

ft  is  worth  remarking  that  the  results  and  computations  are  monotone  in  qs ;  any  error  in  its 
estimation  carries  through  in  Qs. 

ft  is  also  reassuring  to  observe  that  Theorem  1  gives  the  correct  performance  bound  for  UH . 
In  UH ,  the  probability  that  a  location  is  vacant  at  time  an  is  1  —  a  +  T,  and  Prob(M(T}  /))  < 
nAfi-A1)  ,  whence  q{t)  —  1  —  1  =  ( 1  -  |-)(  1  +  is  a  multiplicative  vacancy  overestimator. 

In  this  case,  Q(an  +  1)  =  _  HHi)  =  (a  -  a2/2)(l  + 

Theorem  1  says  that  for  UH , 


E [probean\  < 
< 

< 


,  1  =  +  O(-)  +  Err(\I\,n) 

^l-2(a-a2/2)(l  +  0(|/|)/n) 

,  1  +  0(  ^)  +  Err(  |/|,  ft) 

^1-2(0-02/2)  n 

rw^  +  0(^)  +  Err(|/|’an)- 
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In  fact,  we  ought  to  observe  that 


E \probean\  < 


1  +  0{  — )  +  Err(k,  an). 


1  —  a 


This  sharper  bound  follows  by  noting  that  Q(t)  =  ( |-  -  2'(^)2)(1  +  ^p-)z  for  a  partial  dependency 
set  of  size  j.  Substituting  this  formulation  into  (12)  gives  a  sum  of  error  terms  in  (13)  that  are  of 
the  same  form  as  (14). 

We  now  use  Theorem  1  to  bound  Err(k,an). 


Corollary  4.  Let  q  be  a  multiplicative  vacancy  estimator,  in  DH  and  DH^ ,  for  +  3k  <  ip.  Set 
2 Q[an)  —  %;J2o<j<an  ?(i)s  and  suppose  that  2 Q(ctn)  <  8  <  1.  Then  Err (k,  an )  =  ()(  —j\  and 

hence  for  k  > 


E [probean]  < 


1  -  2 Q( 


an 


+  0(  — ) 

yn 


Proof: 

1)  Recall  that  Err1(A:,j)  is  the  probability  that  the  vertex  count  \dep(xj,  D)\  >  k.  We  use  c0n 
as  an  overestimate  of  the  first  c0n  or  fewer  probes  that  occur,  in  this  case,  and  show  that 
CQiiErri  ( k,  an)  —  for  suitable  k.  So  suppose  that  \dep{xQni  D)\  >  k.  Then  some  x  e  D 
has  a  partial  dependency  set  of  size  &,  where  k  <  k  <  2k.  Indeed,  let  xt  be  the  first  key 
in  D  to  have  a  dependency  set  of  size  k  +  1  or  more.  Then  each  child  of  xt  in  G(xt,D ) 
can  have  a  dependency  set  of  size  k  or  less.  Sequencing  over  Gi(xt,  D),G2(xt,  D), . . .  gives  a 
family  of  dependency  graphs  with  vertex  counts  growing  by  steps  of  k  or  less,  and  eventually 
exceeding  k.  Hence  one  of  these  counts  must  be  within  [&  +  1, 2k],  It  follows  that  Erri(k ,  an)  < 

J2i<an'22k<j<2k  PUii)  anJ2k<j<2k  P  U  i  an)  ■ 
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The  proof  of  Theorem  1  (inequality  (12))  allows  us  to  bound  this  sum  as  follows. 

c0nErr1(k,an)  <  ac0n2  ^  ( 1  +  0(  j3/n ) )  f  .  ( -2Q(an)  )-7_1 

j=k+ l  \J  ~  J 

<0(n*)-£P 'n(l-i) 

j>k  i=l 

<  0(n 2)  pi  jj  e-1/2* 

j>k  i=  1 

<  0(n2)  E  fpe-{\ogj)/2 

j>k 

<  °(”2)  E  -> 

j>A)  V-l 


<0( 


n 


Pk 


Vk(l-P) 


(15) 


Taking  k  =  gives  an  additive  error  of 

2)  Recall  that  Tlr^cm,  &)  equals  2\dep(xan,  D) \  in  the  case  that  \dep(xj,  D)  I  <  k  and  Xj  has  some 
unsuccessful  probe  that  does  not  increase  its  partial  dependency  set.  Let  \dep(xan,  D)\  =  j. 
There  are  at  most  j  probes  of  xan  that  could  be  the  first  to  revisit  a  dependency  set.  There 
are  at  most  j  —  1  vertices  that  could  be  the  probe’s  destination.  Let  the  destination  node  be 
Key  e  can  be  reached  by  a  direct  probe  edge  from  xan.  and  by  some  earlier  path  from  xan  that 
comprises  one  or  more  edges.  Let  the  node  probed  by  xan  along  this  path  be  w.  If  w  —  z-.  we 
have  a  constraint  that  two  specific  probes  of  xan  are  the  same,  which  occurs  with  probability 
n  for  each  of  the  (^)  or  fewer  possibilities.  Otherwise,  we  compute  the  probability  that 
the  DAG  structure  hashed  as  specified  by  a  traversal  that  begins  with  w.  reaches  all  of  its 
descendents,  and  then  continues  from  xQn.  The  embedding  of  w  is  unconstrained,  but  xan  will 
be  constrained  at  two  probe  locations. 


The  expected  number  of  ways  these  events  can  occur,  in  this  case,  is  7V(j)(^)  1  +  0(^-)). 

and  hence  Err2(cm,  k)  <  Y?j=i  N(j) 

3)  Recall  that  Err^(k,an)  is  the  probability  that  \dep(xani  D )|  —  j  <  k  and  xan  has  at  least  2 j 
probes  to  dep{xani  D).  The  probability  that  the  dependency  set  G(xan ,  D)  occurs  as  stated  with 
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j  nodes  is  bounded  by  (1  +  0(j3 /n))N(j)P(j,  j)2j3 /n3+1 ,  since  xan  must  have  two  consecutive 
probes  among  the  first  2 j  that  visit  locations  previously  probed  by  xan.  We  take  cqtl  as  an 
estimate  for  the  number  of  probes  in  this  case. 

4)  Recall  that  Err^(an)  is  the  expected  number  of  probes  of  length  c^n  or  more  needed  to  in¬ 

sert  xan ■  Then  the  expected  number  of  probes  contributed  in  this  case  is  c0nProb{ at  least 
c0n  probes  occur}  +  X4>c0«  Tro6{ at  least  t  probes  occur}.  The  first  term  is  already  counted 
by  cQnErr 3  +  c0nErri.  The  second  term  is  bounded  by  according  to  our  robustness 

requirement  that  long  probe  sequences  be  rare. 

5)  Recall  that  Err(k,j )  =  c0nErri(k,j )  +  Err2(k,j )  +  c07iErr3(k,  j)  +  Err^(j).  Taking  k  — 
3log°U/V)  §ives  an  additive  error  of  for  c0nErr j.  Err^  =  0(1 /n)  by  the  global  robustness 
requirement.  As  for  Err2  +  c0nErrSl  summing  these  error  terms  over  the  range  of  dependency 
set  sizes  j  gives  a  formulation  that  is  equivalent  to  (14),  and  hence  0(L)  in  size.  1 

We  are  now  ready  to  identify  our  vacancy  estimator. 

Definition  15:  The  vacancy  criterion  M(h)(T,I)  and  its  probability  q^)(T:I). 

•  Let  JlfW(T,  /),  be  the  vacancy  criterion:  for  j  —  1, 2, . . . ,  |/|,  no  tuple  S  c  {a?i, . . . ,  x-j  .  | }  -  D 'y- 
of  size  |5|  <  h  hashes  into  a  dependency  tree  G  rooted  at  location  Iy 

•  Let  q(h)(T,I)  denote  the  probability  that  the  vacancy  criterion  h'PO(T.  1)  holds. 

Thus,  Mlfe)(T, /)  is  a  vacancy  criterion  with  limited  backtracking.  The  criterion  deems  a  location 
l  to  be  occupied  by  time  t  if  some  witness  subsequence  S  -  comprising  h  or  fewer  items  among 
the  first  t  —  1  elements  -  hashes  locally  into  a  dependency  tree  rooted  at  £.  Otherwise  £  is  deemed 
vacant,  ft  should  be  noted  that  a  witness  sequence  S  may  not  represent  the  dependency  graph 
actually  rooted  at  £.  Moreover,  it  turns  out  that  we  will  not  actually  determine  ^0)(T,  /);  instead, 
we  will  estimate  its  value  with  moderate  accuracy.  For  our  calculations,  the  vacancy  estimator  will 
be  virtually  unaffected  by  the  assumptions  of  limited  independence,  as  well  as  the  specific  hashing 
model,  but  will  be  strong  enough  to  give  good  hashing  bounds  when  used  in  Theorem  1.  We  also 
note  that  AfW(T,  /)  excludes  small  dependency  sets  that  hash  into  a  location  I  if  the  structure  is 
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not  a  tree  or  if  the  structure  uses  more  than  one  probe  to  locations  in  I.  These  overestimates  of 
the  vacancy  will  turn  out  to  be  asymptotically  negligible. 


Lemma  4.  For  any  fixed  a  <  1  and  fixed  h.  in  uniform  hashing, 


(«m)  <  i 


1  (2a-a2)fc+1 


+ 


0(1) 


Proof:  If  the  event  M(h\Tj,  Ij)  occurs,  then  either  is  vacant  at  time  Tj  or  the  true  depen¬ 
dency  graph  rooted  at  Ij  has  at  least  h  +  1  vertices.  The  probability  of  the  event  ,  Ij)  is 

therefore  bounded  by  the  probability  that  Ij  is  empty  at  time  T.j  plus  the  probability  that  the  true 
dependency  graph  rooted  at  location  Iv  at  time  Tj ,  is  a  DAG  with  h  +  1  or  more  vertices. 

Such  a  dependency  graph  differs  from  the  dependency  graphs  we  have  analyzed  so  far  in  just 
two  respects.  First,  the  root  is  required  to  be  embedded  at  the  fixed  location  Ij.  Second,  the  root 
could  be  any  key  in  (aq,  x2, . . . , 

Let  Pioc_j(k,T.j )  be  the  probability  that  the  true  dependency  graph  rooted  at  location  Ij  by 
time  T?  is  a  DAG  with  k  nodes.  It  is  easy  to  see  that  the  computation  for  P(0C_j{k,TJ)  is  very 
similar  to  that  for  P(k,Tj).  The  reasoning  of  Lemmas  2  and  3  gives  the  following. 


Ttoc  i(k.an) 


k 

<  (1  +  0(k3)/n)  £  N(k)n-k  n  q(Ti) 

\T\=k  i=l 

Tc  [l,on-l] 


<  (1  +  0(k3)/n)N(k)Q^k 

=  -(?)  (l  +  ^h)(-2<?(-))‘. 


(16) 


We  have  seen  that  in  the  uniform  hashing,  of  U H ,  2 Q(an)  <  2a  -  a2,  since  the  sum  for  Q 
includes  fewer  than  an  terms.  Hence  for  suitable  A", 

K 


ProbUH{M(h)(an ,  Ian)}  <(!-«)-  £  (l  +  j  (-2 Q(an))k  +  +  Arr,(A, 


an 


k=h-\-l 


!  -.  ,  ,  (2 q-q2)*+1  0(1)  ,  +,  +.  +  .  /1R,  _ 

<(l-a)H — v.  — - - by  the  estimate  m  15).  1 

VhTiii-af  "  y  v 

It  is  worth  remarking  that  our  vacancy  estimator  actually  convergences  at  a  much  faster  rate  (as 
a  function  of  h )  than  the  estimate  given  by  Lemma  4.  The  Lemma  used  a  bound  for  the  probability 
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that  the  true  dependency  graph  rooted  at  a  specific  location  has  h  or  fewer  vertices,  as  opposed 
to  the  probability  that  some  witness  DAG  with  h  or  fewer  vertices  can  certify  that  the  location  is 
already  occupied.  When  the  case  h  =  1,  for  example,  it  is  easy  to  verify  that  the  probability  that 
a  location  will  not  have  been  hit  by  a  first  probe  of  an  items  is  1  -  e~an  +  pp,  which  is  already 
much  smaller  than  the  corresponding  bound  predicted  by  Lemma  4. 

Assume  for  the  moment  that,  as  we  will  soon  prove,  q^){t)  is  a  multiplicative  vacancy  overes¬ 
timator  that  satisfies  q(h\t)  <  (1  -W)  +  ^  ,  Then  this  estimator  can  be  used  with  simple 

error  estimates  from  Taylor’s  Series  to  establish  the  following  Corollary. 


Theorem  2.  Suppose  that  qp{t)  —  (1  b) —  +  pp  is  a  multiplicative  overesti¬ 

mator  in  DH y,  and  hence  Err(k,an)  =  0(l)/n,  for  large  enough  if)  =  O(logn).  Let  2 Qs(an)  — 

«  Yji<nn  Qsh\t)  <  2a-a2jt-2p~a  ) - ,  and  choose  h  so  that  2a-a2  -j-2p~a  ) - is  closer  to  2a-a2 

n^t<an'is  U-  v/T+T(  1  a )'?  Vh+l(l-a) 

than  1.  Then 


E [probe a n\  < 


1  ,0(1).  1  ,  ^  (2 a-a2)*+1  ,,0(1) 

Pi -2 Qs(an)  n  l~a  Wh  +  l{l-afl2 


I 


This  theorem  is  actually  our  main  result.  The  remainder  of  this  paper  is  solely  aimed  at  showing 
that  the  probability  of  M^(T,  /),  for  any  constant  h.  can  be  adequately  estimated  in  DH,  even 
with  limited  independence.  Accordingly,  we  will  define  a  witness  graph  \\  ^l)[T,  1)  for  locations  I 
and  corresponding  times  T.  Intuitively,  the  witness  graph  ought  to  contain  all  vertices  (hash  keys 
from  D)  that  could  possibly  belong  to  some  local  dependency  tree  that  comprises  h  or  fewer  vertices 
and  has  its  root  located  in  I. 

This  witness  graph  is  constructed  in  a  greedy  top-down  manner,  much  as  the  construction  of 
dependency  trees.  Let  D'  =  (x’j, . . . , be  the  sequence  ( D  -  DT)  in  reverse  order.  The  witness 
set  will  be  found  by  scanning  D'  to  see,  essentially,  which  items  might  wind  up  hitting  relevant 
items  within  their  first  h  probes.  We  call  the  locations  of  relevant  items  eligible  collision  points.  If 
an  item  hits  an  eligible  collision  point  at  its  Ar-tli  probe,  where  k  <  h,  the  item  is  inserted  into  the 
witness  set  and  its  first  k  —  1  probe  locations  are  inserted  into  the  set  of  eligible  collision  points, 
since  these  locations  must  be  already  occupied  by  the  (real)  time  the  key  is  actually  hashed,  if  it  is 
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to  require  k  probes  for  insertion.  Then  the  next  item  is  processed.  The  eligible  collision  set  (i.e.,  set 
of  eligible  collision  points)  is  initialized  to  /.  As  this  procedure  suggests,  the  witness  set  is  defined 
without  reference  to  the  time  constraints  T ;  this  simplification  can  only  increase  the  number  of 
relevant  keys  and  eligible  collision  points  that  are  found. 

We  achieve  better  bounds  by  including  the  sum  of  the  number  of  probes  consumed  by  the 
sequence  of  collisions  that  is  responsible  for  the  presence  of  each  eligible  collision  location.  If  some 
item  takes  k  probes  to  reach  an  eligible  collision  location,  then  only  h  —  k  +  1  probes  are  available 
for  an  item  that  (at  some  earlier  real  insertion  time)  fills  one  of  the  first  k  —  1  probe  locations  in  the 
probe  sequence.  The  following  procedure  provides  a  formal  construction  of  the  witness  graph.  The 
eligible  collision  set  is  represented  by  the  family  £,-(t),  (i  e  [0,  h  -  1]),  where  i  is  an  underestimate 
of  the  size  of  the  dependency  graph  which  led  to  the  addition  of  items  in  £j(r),  and  r  e  [0,  \D'\] 
indicates  that  the  locations  in  £t(r)  are  available  for  collisions  within  the  first  h  -  i  probes  of  item 
x'(t  +  1).  Initially  £0(0)  =  I  and  £8(0)  =  0  for  i  >  0.  After  x'T  is  processed  and  hence  £,(~) 
determined,  £,-(t  +  1)  is  initialized  to  £,-(t)  and  additional  locations  might  then  be  inserted  into 
C(t  +  1),  depending  on  the  first  h  probes  of  *c/r+1-  In  particular,  if  hits  a  location  in  £j(r  +  1) 
on  probe  i ,  with  j  +  i  <  h.  then  x'T^  is  inserted  into  the  witness  set  II  and  its  first  i  —  1  probes  are 
inserted  into  +  1). 

This  graph  may  be  viewed  as  directed  and  bipartite,  with  keys  and  locations  as  vertices,  outgoing 
edges  from  a  key  to  locations  that  correspond  to  unsuccessful  probes,  and  an  incoming  edge  to  a 
key  from  the  eligible  hit  location  where  the  key  must  reside,  if  it  is  to  belong  to  some  dependency 
tree  of  size  h  or  less  with  root  in  I. 

Because  the  keys  are  processed  in  reverse  order,  with  a  greedy  interpretation  as  to  their  eventual 
hash  location,  the  procedure  will  include  many  elements  in  the  witness  set  that  will  turn  out  to  be 
irrelevant.  Some  items  are  assumed  to  reside  in  probe  locations,  for  which  earlier  probe  locations 
will  turn  out  to  be  empty;  for  others,  the  dependency  graph  will  turn  out  to  be  to  big.  On  the 
other  hand,  these  circumstances  will  only  be  evident  after  the  structure  is  completed  and  all  items 
are  inserted. 
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It  is  easy  to  believe  that  the  elements  of  any  actual  dependency  graph  of  h  or  fewer  items  that 
is  rooted  in  I  are  included  in  the  witness  set.  Actually,  we  are  obliged  to  state  what  happens  in  the 
unlikely  event  that  several  of  an  item’s  first  h  probes  hit  eligible  collision  points.  The  reason  that 
this  issue  must  be  addressed  is  that  once  two  probes  are  given  specific  values,  the  remaining  probes 
may  be  completely  deterministic,  in  DH . 

If  an  item  in  D'  —  DT  incurs  multiple  collisions,  we  elect  to  throw  away  the  “witness”,  and 
not  record  the  collisions.  This  produces  a  simpler  witness  graph  that  overestimates  the  probability 
q(h){T,I)  that  event  AfW(T,  I)  occurs,  since  some  witnesses  for  /)  will  not  have  been 

included  in  the  graph.  On  the  other  hand,  we  may  underestimate  q(hl{T,I)  as  the  probability  that 
the  vacancy  criterion  MO)(T, /)  holds  and  no  multiple  collisions  occur  in  the  witness  set. 

Definition  16:  The  witness  graph  WG^h\T,  I)  =  (£,1T,  E). 

Let  witness  graph  WG^h-(T,  /)  =  (£,  14  ,  E )  be  procedurally  specified  by  the  following  algorithm. 

1.  D'  <—  Rev erse(D  -  -Dy); 

2.  £0(0)^-/r; 

3.  14/(0)  <-  0; 

4.  for  i  *  1  to  h  —  1  do  £,(0)  <—  0  endfor; 

5.  for  r  <—  1  to  \D'\  do 

6.  for  i  <—  0  to  h  -  1  do  A'(r)  <—  £j(T  -  1)  endfor; 

7.  W(r)  <—  W{t  -  1); 

8.  if  for  exactly  one  pair  (i,j)  i  e  [1,  h\,  j  <h  —  i  :  p(x'r,i )  e  C.j{r  -  1)  then 

{  A  single  collision  occurs  at  probe  location  p{x'T.  i).  } 

9.  t^p{x'T,i.y. 

10.  £i+j-i (r)  A+i- i(r)  uk<i  {p(x't-,  *)}; 

1 1 .  W ( r)  <—  W(t)  U  {a:^-} ; 

12.  E  E  U  (£,  x'r)  labeled  *; 

13.  E  <—  E  U^:<j  {(x'Tjp(x'T,  fc))}  labeled  k 

14.  endif 
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15.  endfor; 

16.  W-  W(]D'\); 

17.  (l^l); 

18.  Replace  all  location  labels  referencing  indices  in  L  -  I  by  pointers  to  abstract  vertices. 
This  procedure  enables  us  to  compare  the  probabilities  that  a  given  witness  graph  structure 

will  occur  in  UH ,  DH,  and  DH (See  Lemma  5.)  To  establish  an  asymptotic  equivalence  among 
all  three  models,  it  is  essential  that  the  actual  probe  locations  apart  from  those  hitting  /  be  absent 
from  the  structure  (line  18).  All  that  is  recorded  in  the  structure  is  which  probes  collided  with 
which.  To  show  that  the  vacancy  estimates  are  almost  the  same  in  the  three  models,  we  need,  in 
part,  estimates  to  bound,  with  high  probability,  the  size  of  a  witness  graph  as  a  function  of  the 
dependency  set  size  \I\.  Lemma  6  gives  a  crude  (and  simple)  bound  for  the  expected  size  of  the 
witness  graph,  and  shows  that  the  witness  graph  is  proportional  to  |/|  with  sufficient  probability. 
Lemma  7  gives  a  better  bound  on  the  size  of  witness  sets  and  thereby  establishes  a  better  bound 
for  the  independence  ip. 

Let  If  c  D'  be  a  candidate  witness  set  of  k  keys  with  11  =  {x^ , . . . ,  }.  Let  WG  be  a  candidate 

labeled  witness  graph  for  the  pair  (T,  I)  with  key  vertex  set  W ,  and  (abstract)  eligible  location  set 
C.  with  Cpr)  as  defined  in  the  formal  procedure.  Recall  that  by  construction,  any  such  WG  is  a 
forest,  when  edges  are  viewed  as  being  undirected;  there  will  be  no  cycles. 

Definition  17. 

•  Let  Z(t)  =  £j“o(/i  -*)|£i(r-)|. 

•  Let  Prob^jH{WG}:  ProbpH{WG},  Prob^H^  {WG}  be  the  respective  probability  that  WG  is  the 
actual  witness  graph  for  the  pair  (T, /)  in  U H ,  DPI  and  DH,^. 

.  Let  Pro bfPff  { WG} ,  Pro^{WG},  ProVgg  {WG}  be  the  respective  probabilities  that  WG  is 
the  witness  graph  for  the  pair  (T,  I)  and  that  no  vertices  where  eliminated  in  its  construction 
due  to  double  hits. 

We  suppress,  for  notational  simplicity,  the  implicit  dependence  of  these  probabilities  on  sets 
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D,  T,  and  I.  We  shall  use  the  notation  WG  in  two  contexts.  When  WG  is  selected  from  a  set  of 
candidate  witness  sets,  WG  will  represent  a  sample  set  of  keys  and  a  hash  structure;  here  we  will 
compute  the  probability  that  WG  is  the  actual  witness  set  that  occurs.  When  WG  is  not  bound  as 
a  candidate,  it  will  represent  the  actual  witness  set;  here  the  computational  issue  is  the  probability 
that  its  size  \WG\  is  extremely  large.  Due  to  the  multiplicity  of  hashing  models,  it  is  convenient  to 
extend  Definition  17  as  follows. 

Definition  18. 

•  Let  Probw {WG}  and  ProbPure {WG}  be  used  in  expressions  that  hold  for  each  of  the  three 
models. 

•  Let  Problv {WG}  and  Prob£UTe  {WG}  and  be  used  in  expressions  where  •  is  a  free  subscript  that 
holds  when  all  «-s  are  simultaneously  replaced  by  UH,  DH ,  or  DPI 

Lemma  5.  Let  WG  be  a  candidate  witness  graph  with  £  =  0{n}/2).  Then 

1)  Prob^H{WG}  =  ProbfH{WG}{  1  +  0(|£|2)/n). 

2)  ProbpH^{WG}  =  ProbpH{WG}(  1  +  ee~D),  for  ip  >  (h|LL|  +  6£  +  3|/|  +  D\  and  some  e:  |e|  <  1. 

3)  ProPlUTe{WG}  =  Probf{WG}  (l  +  for  UH,  DH,  and  DH^,  for  i>  >  (. h\W\  +  6£  + 

m  +  D). 

Proof:  The  counting  statistics  for  witness  sets  is  similar  to  that  for  dependency  graphs,  but 
differs  from  the  latter  in  two  aspects.  The  simplest  change  is  that  witness  sets  have  embedded  roots 
(in  locations  of  I).  The  other  difference  is  that,  unlike  the  dependency  sets  of  Theorem  1,  which  are 
based  entirely  on  local  properties  within  (windows  of)  k  vertices,  witness  graphs  (forests)  are  global 
structures  selected  from  the  large  subsequence  D-DT  and  locations  I.  Consequently,  the  probability 
that  a  forest  WG  is  the  witness  forest  in  question  involves  both  local  hashing  properties  and  the 
event  that  the  many  items  not  included  in  WG  either  do  not  hit  any  of  the  eligible  hit  locations 
belonging  WG  or  hit  the  location  set  more  than  once.  Most  importantly,  these  probabilities  turn 
out  to  be  nearly  identical  in  our  three  models. 
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Definition  19. 

•  Let  Prob^lj‘l { WG}  be  the  probability  that  the  local  set  comprising  the  keys  14  c  D  hit  the 
prescribed  virtual  locations  in  the  manner  prescribed  by  WG,  and  with  no  additional  collisions 
among  the  eligible  probes  of  keys  in  W. 

•  Let  Prob™°j^~hli{WG}  be  the  probability  that  vertices  in  D  -  DT  -  W  incur  either  no  hit  or 
multiple  hits  to  the  eligible  location  set  within  the  requisite  number  of  probes,  as  they  are 
processed  by  the  algorithm. 

•  Let  Prob™flhlt{WG}  be  the  probability  that  D  -  Dj>  -  W  do  not  hit  the  eligible  location  set  at 
all,  within  the  requisite  number  of  probes. 

We  extend  these  definitions  to  models  DPI  and  DH.^t. 

Clearly  ProbJjH{WG}  =  Prob^~^lt{WG}  x  Prob^°^~htt  {WG}.  In  P) It .  such  a  simple  formu¬ 
lation  is  not  quite  true,  since  Prob^fjli {  WG}  and  Prob'^Jj ~ lllt {  WG)  will  depend,  somewhat,  on 
just  which  actual  locations  are  used  for  each  possible  embedding  of  WG.  Now,  Prob(~hti{WG} 
is  (in  UH,  or  DH),  between  (1/n  -  0{h\C\l  n2))k  and  (  n_om  )*;  where  k  is  the  number  of  keys 
belonging  to  WG.  The  extra  factor  of  L  comes  from  the  fact  that  unlike  dependency  sets,  the 
roots  in  witness  forests  are  explicitly  embedded  in  specific  locations.  In  UH ,  one  could  evaluate 
this  probability  precisely  as  a  function  of  the  sizes  |£;(r)|,  for  a  given  witness  graph.  Such  an 
evaluation,  however,  is  unnecessary,  since  it  suffices  to  show  that  the  computations  in  the  three 
models  are  virtually  identical.  Given  a  specific  embedding  of  the  candidate  WG  structure,  the 

second  factor  in  Prob^H{WG}  is  readily  written  as  Prob^°^~htt  {WG}  —  Prob{  -is(t)}, 

t:  x‘TeD'-W 

where  s(rj  is  the  event  that  x'T  e  D'  experiences  a  single  hit  with  respect  to  the  embedded 
sets  £0(r  -  1  )j  ■  ■  •  7  £h-l(T  -  !)■  Let  <r(r,  WG),  with  appropriate  subscript  U PI ,  DPI  and  DPI 
denote  Prob{s(r)}.  Since,  for  any  specific  embedding  of  WG,  the  events  s(r)  are  mutually  in¬ 
dependent  in  both  DH  and  UH ,  Probnotl~hlt  {WG}  —  nr-  &  eD'~  w ( 1  -  <j(t,  WG))  in  these  two 
models,  ft  is  easy  to  see  that  for  all  r,  both  <J/j // (t,  WG)  and  (Tjjjj (r,  WG)  are,  up  to  fac¬ 
tors  of  (1  +  0(1 /n)),  between  ^rf~L  =  o(h  ~  0 and  ^rf~L  -  .  Hence 
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1  -  —  (1  -  &uh(t,  WG))(  1  +  for  any  embedding  of  the  structure,  and 

any  constant  h.  Probnoil~hli  {WG}  is  therefore  the  same  in  UH  and  DH  to  within  a  factor  of 
(1  +  0{\C{t)\2 / n2))T  —  (1  +  0(  \C(t )|2/ra)).  In  DH the  events  s(r)  are  only  (ip  -  h\W\  -  3|/|)-wise 
independent,  but  cfdh(t,  WG)  —  (r?  H  (7).  (as  long  as  ip  >  h\W\  +  C  +  3|/|).  Lemma  A2.2  in 

the  Appendix  shows  that  in  this  case, 


\ProbnD°£-Mt{WG}  -  Prob™g~hit{WG}\ <  Prob™g~hit {WG}e~D  if  ip  -  h\W\  -C-  3|/|  >5 C  +  D, 


where  C  is  used  as  an  overestimate  for  the  expectation  YreD1  gDh{t->  WG).  Consequently, 

1)  Protf%E{WG}  =  ProbffE{WG}(l  +  0(\£\2/n)),  and 

2)  Prob^H  {WG}  =  ProbpH{WG}(  1  +  0(e-D)),  for  ip  >  h\W\  +  6£  +  3|/|  +  D.  It  is  easy  to 
verify  3).  Let  Prob^ D,_x,  {WG}  denote  the  computation  for  Prob\v  {WG}  over  the  set  D'  -  x'T. 
where  D'  —  D  -  Dj  -  W .  Then 


ProbYiWG]  -  ProH.UTe  {WG}  <  Prob^,^  {WG}  Q  =  Prob™ {WG}0( 


\WGf 


n 


where  the  factor  (2)  is  an  estimate  of  the  probability  that  x'T1  has  two  or  more  eligible  probes 

into  C.  | 


We  can  now  show  that  q(h)(T,  /),  the  probability  that  our  vacancy  criterion  holds  for  locations 
/,  can  be  successfully  approximated  in  all  three  models  by  the  probability  that  these  locations  are 
declared  empty  by  our  witness  graphs.  It  will  follow  that  the  probabilities  /),  // ( T,  i  ) . 


q{DH^TiI)  differ  by  a  factor  of  at  most  ^1  +  u'^1  1  j  in  the  three  models,  for  an  appropriate  choice 
of  ip.  Lemmas  6  and  7  both  establish  this  equivalence. 


Q(ih2) 


Lemma  6.  Let  T  =  (Tl5 . . .  i?|T|)  be  a  sequence  of  increasing  time  stamps  with  Tjj|  =  an,  and  let 
/  be  an  arbitrary  sequence  of  distinct  table  locations  with  |/|  =  \T\.  In  addition,  let  the  following 
definitions  hold. 


Definition  20. 

•  Let  the  random  structure  WG^h-(  T,  I)  —  (£,  IT,  E)  be  the  witness  graph  for  D,  T,  and  1. 

•  Define  the  random  variable  C(t)  —  YaZ 0  -  i) |jC*(t) |. 
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•  Let  7)|  with  appropriate  subscript  UH,  DH  or  DH be  the  probability  that 

LFGW(T, /)  contains  no  witnesses,  and  hence  “declares”  each  location  ij  to  be  empty  at 
insertion  time  Tj. 

Then  for  |/|  =  ojn1/3), 

1)  <  w%(T,I)  (l  +  SLflf 

wDHt(T,  I)  <  (T,  I)(l  +  e~D)  +  Prob{ |£|  >  A'},  for  0  >  3|/|  +  ViK  +  D\ 

2)  For  ip  >  14(h  +  2)!(|/|logTd¥  +  logh  +  logra), 

V  .  /)  =  (T,  /)  (l  +  ,  in  UH,  DH,  and  DH4„ 

and  hence  witness  graphs  formalize  a  good  vacancy  criterion  for  all  three  models. 

Proof:  ft  is  convenient  to  analyze  the  construction  of  the  witness  forest  as  if  each  collision 
p(x't,i )  g  Cj{t  -  1)  were  the  outcome  of  a  Bernoulli  trial  with  probability  of  success  \£j(t  -  1)| fn. 
Thus  each  time  step  is  viewed  as  contributing  (*)  (somewhat  dependent)  Bernoulli  trials,  of  which  at 
most  one  may  result  in  success.  This  simplification  will  have  a  few  insignificant  consequences,  which 
we  are  obliged  to  acknowledge.  The  simplest  is  that  the  requirement  that  at  most  one  of  the  trials 
be  successful  can  be  ignored,  since  we  are  interested  in  establishing  upper  bounds  on  the  number 
of  successes.  The  other  is  that  a  single  (real)  probe  can  hit  at  most  one  of  the  disjoint  sets  £j,  and 
modeling  the  outcome  with  h  different  Boolean  trials  undercounts  the  probability  that  at  least  one 
success  occurs,  since  the  conditional  probability  that  C.j  is  probed,  given  that  £0 are  not 
probed  can  be  increased  by  a  factor  of  about  ■ — — ■ — -.  The  simplest  resolution  of  this  problem 
is  to  include  this  factor  in  our  model  implicitly,  by  replacing  the  table  size  n  by  the  parameter 
and  recast  the  probabilities  in  terms  of  n\,  which  will  be  adjusted  a  posteriori.  We  follow 
this  prescription,  although  technical  arguments  can  establish  that  no  such  a  rescaling  is  actually 
necessary. 

Let  Af(i,  t)  =  E[|£j(f)|],  We  overestimate  the  probability  that  (p{%'t,j -i  + 1)  e  £,•(<- 1)  is  the  only 
hit  of  x't)  as  the  outcome  of  a  Bernoulli  trial  with  probability  of  success  equal  to  .  We  may 

ignore  the  time  restriction  on  locations  in  I  as  prescribed  by  the  algorithm,  and  form  the  system 
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for  Af  as  follows: 

W(0,  t)  =  |/|,  and  A f{j,  0)  =  0,  j  =  1, 2, . .. ,  h  -  1; 

1)  +  J2U  ~  1)M-  (17) 

i<j 

A  gross  overestimate  of  E[£y(cm)]  (and  Af(j,an))  is  given  by  the  system: 

W,(0,  t)  =  |/|,  A -g(j,  0)  =  0,  j  =  1, 2, ,  . . ,  h  -  1; 

Xuij.t) -Xyij.t  1  )  -../^Ayf /.o// )///,.  '  l  '! 

i<j 

In  terms  of  the  mndom  vnrmble  nitty  define  ct  stochastic  dommntor  ^ 

with  E[£j(f)]  =  Ng(j,t )  as  follows.  Let  •  be  the  outcome  of  t  independent  Bernoulli  trials 
with  probabilities  of  success  o  C'\ ,  where  £o(f)  =  7.  We  now  prove  by  induction  on  *  that  our 
overestimate  of  Yli<j  £i(an)  (and  hence  J2t<  j  |£j(cm)||  is  bounded  by  B j  =  iT+Hll wit h  probability 
1  _  Je-5o/2,  for  any  choice  of  Bq  y  |Lj.  It  sufhces  to  set,  for  simplicity  and  additional  overcount, 
cm  =  nj .  Clearly  £o(cm)  <  7?o  with  probability  1.  In  general,  the  probability  that  Yli<j  £i{an)  >  Bj , 
subject  to  the  condition  that  Ekj_i  A'(an)  <  is  bounded  by  the  probability  that  £j(an)  > 

Bj  -  Bj_1,  which  can  be  rewritten  as  ^ ^ n -  >  (1  +  l/j)B.j_1,  since  Bj  —  (j  +  2)_B?_1.  ^ n'^  is 
bounded  by  the  number  of  successes  in  rii  Bernoulli  trials  with  probability  of  success  Bj_i/rii.  A 
standard  Chernoff-Hoeffding  bound  for  the  sum  of  independent  Bernoulli  trials,  X  with  expectation 
E[A'],  is  Prob{ X  >  (1  +  e)E[A’]}  <  t-P£[X]/‘i  _  for  0  <  e  <  1.  This  estimate  shows  that  for  t  <  »?j,  the 
probability  that  £j(t)  exceeds  (1  +  1  / j)BJ_l  is  bounded  by  e~Bi~l^£ ,  which  is  at  most  e~BoP  for 
all  values  j  >  1.  Consequently,  the  probability  that  J2i<j  ^i(an)  >  Bq,  is  bounded  by  je~Bo/4. 

Let  |£|  =  |£(cm)|  =  J2i<h- 1  IA'(an)|.  We  have  shown  that 

Prob{\C\  >  ( h  +  l)!B0/2}  <  ft  e--00/4,  for  any  B0  >  |/|. 

Let  ca  —  -21og(l  -  a).  Choosing  SjL  =  \I\ca  +  D,  we  see  that 

Prob{\£\  >  (. h  +  l)\{\I\ca  +  D)}  <  }le-{P\ca+D)/2  <  _  ayI\e^D/2  (19) 
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Furthermore, 


k-ca\I\(h+l)\ 

y  Prob{\C\  =  k}k2  <  ~y  k2h(l-a,y^e  2(ft+1)! 

k^>Ccx\I\  { h-\- 1 )!  k^>C(x\I\ { h-\- 1 )! 


<  h(  1  -  a)^l(8ca|/|(/i  +  l)!)2  fg  + 

k> 0  V 


k 


8ca\I\(h  +  1)! 


_ ^ 

e  2(ft+l  )! 


<  Ml  -  af'iScMh  +  l)!)2  E  (l  +  8Ctt|/|-;,+  1)!)  ^ 

2k  k 

<  h(  1  -  a)|7|(8ca|/|(h  +  l)!)2  y  g  8c^  |/|(/i+lj!  2(h-\-l)\ 

A;>0 

<h(i-ay(Bca\i\(h  +  iyyy  e  4(/l+1)! 

k>  0 

<  h(l  -  a)l7l(8ca|/|(h  +  l)!)2 - 

1  -  e  4(*+i)! 

<  h(  1  -  a)|7|(8c0|/|)25((/t  +  l)!)3 

=  \i\2(i  -  ayo(i). 


(20) 

(21) 


We  can  now  show  that  the  probability  that  the  witness  graph  WG^h-{T1  /)  contains  no  occupancy 
witnesses  for  any  of  the  locations  is  identical  in  UH  and  DH ,  up  to  a  factor  of  (1  +  0(|/|2/n)). 
Evidently,  w(h)(T,I)  can  be  expressed  as  the  sum,  over  all  legal  candidate  witness  graphs  that 
declare  locations  I  empty,  of  the  probability  that  each  such  graph  occurs.  Denote  by  Wempty(T,  I) 
the  subset  of  all  the  witness  graphs  that  declare  all  locations  in  /  empty  at  the  prescribed  times. 
Let  \WG\  denote  the  size  of  the  eligible  collision  set  |£|.  Clearly 


4'“hT./)=  E  Prob«H{WG] 

WGeWempty(T,I) 

and 

4 ?f(r.n=  E  Prob'tnlWG}. 

WGeWempty  (T ,/) 

We  have  proven  in  Lemma  5  that  for  any  witness  graph  LEG,  where  \WG\  =  0(  // 1  /2) , 
ProbWH{WG }  =  Prob'gH{WG}(\  + 

and 

Prot%Hf{WG)  <  Prob^H{WG}{\  +  e~D),  for  i>  >  !*|H'I  +  <>£,„,  +  3|/|  +  D). 


(22) 


42 


Double  hashing  is  computable  and  randomizable  with  universal  hash  functions 


It  follows  from  (19),  (21)  and  (22)  that 


wDh(T,  I)  <  Prob{\WG\  >  (ca\I\  +  2 log  re)(h  +  1)!} 

+  E  ProbfH{WG}(  1  +  Q(l^G|2b, 


WGeWempty(T,I) 
\WG\<{ca\I\+2\og  n)(h+l)\ 


<  h 


(l-q)lJl 


n 


+ 


J2  ProbfH{WG}  (l  +  °(Cq'I/I^  +  1)1)2 

,  (/./.  V  11 

ProbfH{WG} 


WGeWempty(T,I ) 
\WG\<ca\i\(h+iy. 


+  E 

WGeWempty(T,I) 

ca\I\(h+iy.<\WG\ 

\WG\<(ca\I\+2\ogn)(h+l)\ 

+  E 

WGeWempty(T,I) 
ca  I  (h  + 1 )\<  W'G 
\WG\<(ca\I\-\-2\og  «)(/»+!)! 


ProbfH { WG]  °H^G|2), 


(23) 


< 


,,(1~a||J'  +  wZ  (t,  /)( i  +  o(  (MililM!)!))  +  h[l  _  a)uiO(id2((ft  +  i)!)34); 


where  the  0(  ■)  error  term  comes  from  (20), 


<wp„(T,I)(l  + 


0(|/|! 


n 


which  establishes  the  first  par  of  1).  The  second  part:  (T,  I)  <  w^-H(T,  I)(l+e~D)  +  Prob{\C\  > 

K },  for  'ip  >  3|/|  +  7hK  +  D1  follows  directly  from  Lemma  5. 

To  show  that  w  W(r,J)  is  close  to  g(*)(T,/)  in  all  three  models,  we  recall  that 


£  Pro6P“re{M?}  <  q(h\T,I)<  J2  Probw{WG}, 

WGeWempty  ( T,I )  (T,7) 

and  that  u>W(T, /)  —  J2wGeW  t  (T  I)  Pro^  {WG}.  Lemma  5  guarantees  that  each  term  is  close, 

for  witness  sets  where  C  =  O)??1/2),  and  (19)  shows  that  larger  sets  have  a  negligible  probability 

of  occurrence.  Hence  qih)(T,I)  =  wW)(T,  /)(( 1  +  0{\I\2)/n)  in  DH  and  UH.  A  similarly  tight 

inequality  holds  for  DH provided  that  ip  is  sufficiently  large.  The  bounds  on  if)  are  that  ip  > 

(h\W\  +  6£  +  3|/|  +  logra)  from  Lemma  5.  From  (19)  and  (23),  |£|  can  be  restricted  to  be  no 

larger  than  (ca\I\  +  21og  h  +  21og  n){h  +  1)!.  Finally,  ip  must  be  large  enough  that  the  Chernoff- 

Hoeffding  bound  for  fully  random  Bernoulli  Trials  holds  with  limited  independence.  Bound  (19) 

was  attained  by  modeling  independent  probes  as  independent  Bernoulli  trials.  From  Theorem  5  in 
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[15],  it  can  be  seen  that  (much  more  than)  sufficient  independence  is  achieved  for  an  independence 
ip  —  3|/|  -  h\W\  —  £,.  for  the  maximum  £  value  used  in  (23).  (Alternatively,  Lemma  A2  can  be  used, 
with  an  increase  in  ip  by  factor  of  five,  and  a  nominal  change  in  the  size  of  the  bound.)  Thus, 
it  suffices  to  set  ip  —  7 (h  +  2)\(ca\I\  +  2 log  h  +  2 log  n).  where  we  use  the  fact  that  £  <  \h£\,  and 
as  is  easily  shown,  |LL|  is  unlikely  to  exceed  our  bound  for  £.  The  3|/|-wise  independence  for  the 
dependency  set  is  already  included  in  this  bound  for  ip. 

As  for  the  elusive  riy  and  the  fact  that  the  Bernoulli  trials  are  not  completely  independent,  we 
see  that  the  probability  that  some  set  is  hit,  conditioned  on  the  event  that  some  other  set  is  not 
gives  a  rescaling  by  at  most  As  long  as  an  <  =  n  —  (ca|/|  +  21og/i  +  21ogra)(/i  +  1)!,  our 

results  hold  as  stated.  | 

Recall  that  a  bound  for  the  parameter  |/|  is  achievable  in  terms  of  the  vacancy  estimator  and  Err\ 
in  Corollary  4. 

The  next  Lemma  shows  that  ip  need  only  be  proportional  to  log  n  multiplied  by  a  subexponential 
function  of  h. 

Lemma  7.  Given  /,  let  I0  =  A 3 [ln(  1/(1  -  a)\I\  +  2 Inn].  Let  f(h)  —  e2(/i+i/2)2/3i  ancj  ^a]^e  ^  > 
7h2(3f(h)I0. 

1)  Let  WG^h-(T,I)  =  (£,W:E),  be  the  witness  graph  for  the  pair  (T,  /),  with  =  an  and  let 
Af(j,an)  =  E[£;(an)].  Then  Af(j,an)  <  \I\f(j)  in  UH,  DEI ,  and  DH 

2)  Furthermore, 

Prob{\£\  >  3 hf(h)I0}  <  (1  ' , 

and 

Prob{\W\  >  3 hf(h)I0}  <  2(1~na)'J'. 

3)  The  vacancy  over-estimator  q(h)(T)  (defined  in  Definition  14)  with  respective  subscript  UEb, 
DEI  and  DEI ^  are  equal,  up  to  a  factor  of  (1  +  0(\I\2) /n): 

4%  (T,i)  =  $h(t,  i)  (i  +  ^  =  (r,  i)  (i  +  . 


44 


Double  hashing  is  computable  and  randomizable  with  universal  hash  functions 


Proof:  Equation  (17)  establishes  that  a  suitable  overestimate  for  the  size  of  the  eligible  hit 
set  is  given  by  the  system: 

A/-(0,0)  =  |/|,  and  Af(j,0)  =  0,  j  =  1,2, . . ,, ,  h  -  1. 
t)  =  a r(j,  t  - 1)  +  ~  *)M(h  t  -  l)/n i  ■ 

i<j 

It  will  be  convenient  to  use  generating  functions  and  to  define  a  strong  inequality  f(x)  ^  g(x) 
to  mean  that  each  coefficient  in  the  Taylor  expansion  of  f(x)  (about  X-  =  0)  is  positive  and  at  most 
the  value  of  the  corresponding  coefficient  for  g.  It  will  also  be  convenient,  for  establishing  part  2, 
to  take  the  following  overestimate,  where  I0  >  |/|. 

^0',0)=/0f  i  =  0,1,2,...,/?.-  1. 

h  t)  =  M(j,  t  -  1)  +  t  -  l )///., 

i<j 

and  to  extend  the  definition  of  Af  to  larger  values  of  j  by  setting 


•A^O',0 )=/„,  j  =  0,1,2,.... 

•A f{h  t)  =  -A r(j,  t  - 1)  +  1  -  l)ln- 

i<j 


(24) 


This  latter  modification  does  not  even  affect  the  values  of  Af(j,  0),  for  j  <  h. 

Let  i'(t.  x)  be  the  generating  function  i>(t.  r)  =  J2j  t)xJ.  The  solution  for  v  is  immediate. 
The  initial  condition  is  n(0,x)  =  Usx3  —  and  the  recurrence  equation  becomes 


'(A 


n  r,  x 


j  1<J 

=  v{t  -  1,  x)  +  -A f(j,  t  ~  l)x3  ^  *** 

1  j  k 

=  v{t  -  1,  fc)  +  =  v(t  -  #)  f1  + 

k  \ 


(!  —a?)5 


Hence 


n(t,x)  =  (1  + 


x 


I 


n-y(l  -  a:)2  '  1  -  x 


Now,  1  +  Wl(l-a)2  -  e"l(1_a;)2 ,  (since  e(1~xyz  =  1  +  JJZxp  +Ej>i  h  whence  for  all  t  >  0, 

tx  tx 

(1  +  „  ^  eni<'1~x)2  .  Thus  n(/.,  x)  ■<  eni[1~x)2  and  Af(j,t)  is  bounded  by  the  jth  term  in 

tx 

the  Taylor  expansion  of  eni(1_:c)2  Evidently,  v(t,z)  is  defined  for  all  \z\  <  1.  Consequently,  its 
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Taylor  coefficients  grow  subexponentially.  Indeed,  any  analytic  function,  which  has  positive  Taylor 
coefficients  and  has  an  infinite  subsequence  of  coefficients  a^.  as  large  as  (f)^,  for  any  fixed  p  <  1, 
cannot  be  bounded  as  x  — >  p,  since  the  summation  will  have  an  infinite  number  of  terms  that  become 
at  least  as  large  as  1.  Similarly,  if  the  Taylor  coefficients  of  /  were  bounded  by  some  polynomial 
\d.j\  <  cjk  for  fixed  k  and  c,  then  the  k  +  2  fold  iterated  integral  of  /  would  have  coefficients  of  size 
0(dj/jk+2)  —  0(j~ 2),  which  would  render  the  integrated  function  convergent  on  \z\  =  1.  It  follows 

tz 

that  the  coefficients  of  must  be  superpolynomial  since  the  function  has  an  essential 

singularity  (i.e.  poles  ( — for  unbounded  degree  d)  at  z  =  1.  A  more  precise  estimation  of  the 
coefficients,  with  t  —  an  —  a^n-^  follows. 


i'(a  |  u  | ,  !r )  v — v  alx'J 


xk~2j 


=  EE 

i  k>2j 


f  k  \ 


Consequently, 
summation  is  bounded 


<  £Lo(  a’)#-  Now,  <  U  +  1/2 n3/) 

by  yl  (ife('l+1/2)2,3a;/a)  2„i(j+l/2)2/3 

(3f)!  e  i 


^t,  and  (3/)  <  |^7,  so  the 
Setting  ctq  =  1  establishes 


!)■ 


Given  a  dependency  set  size  |/|,  let  ,\f(j,t)  be  defined  by  (24)  with  J0  =  a\I\  +  b,  where  a  and  b 
will  be  specified  later.  We  now  show  that  for  suitable  a  and  6,  |£;(t)|  is  with  very  high  probability 
no  larger  than  3 AT(j,t). 

Formally,  we  analyze  the  following  modified  process:  if  at  any  timer,  |£;(r)|  >  ( 1  +  1/  h)^  r), 

then  the  process  is  aborted,  and  failure  declared.  The  probabilistic  recurrence  analogous  to  (17)  is 
given  below,  where  X(event)  denotes  the  indicator  function  for  the  event. 


|£0(t)|  =  1^1,  and  \C}{t)\  =  0,  j  =  1,2,.. . ,  h  -  1. 


\kj{r)\  =  | Cj{t  -  1)|  +  Y,(j  ~  l)x(p(yr-,j  ~  i  +  1)  e  £,-(r  -  1)). 

i<] 


Expanding  the  recursion  gives: 

j- 1 

=  E0  “  ®)[E  xiP(Vr‘J  ~  1  +  !)  e  G(r'  -  1))],  for  j>  0  .  (25) 

j=0  t‘<t 
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where  the  event  { p(yT‘,j  —  i  +  1)  G  C^{t'  -  1))  requires  that  no  other  probe  of  yT‘  within  the  first  h 
hit  an  eligible  collision  location. 

A  similar  expansion  for  Af  gives: 

X(hT)  =  E  (*“•?)  E  ^tiiT')ln l  +/o-  (26) 

j  =  0  T'<T 

_  £o_ 

We  are  now  ready  to  show  that  in  DH,  with  probability  1  -  (an(/i  —  l))e  ft3  or  more,  |£;(r)|  < 
(1  +  l/h)JJf(  j ,  r)  for  j  =  0, 1, . . . ,  h  -  1  and  t  —  1.2...,, an.  The  bound  is  clearly  true  for  j  —  0. 
The  method  of  proof  is  to  compute  the  probability  that  the  bound  fails  to  hold,  for  each  pair  (j.  r), 
given  that  it  holds  for  all  smaller  (*,  s),  i  <  ),  s  <  r. 

By  construction,  J2T‘<t  %(p{ Vt1  ,  i  -  j  +  1)  G  £j(T'  -  1)),  is  statistically  dominated  by  the  sum  of 
independent  Bernoulli  trials  X(t' ,j)  with  probability  of  success  equal  to  3  — — — ,  for  r'  —  1, . . . ,  r. 
Let  Xj(t)  —  ^_iX(r,j).  By  assumption, 

T<t  1 

We  now  use  the  following  type  of  Chernoff-Hoeffding  bound,  (proven  in  Lemma  A1  in  the  Appendix), 
to  bound  our  Bernoulli  process: 


Prob{XJ(t)  >  (1  +  1  /h)E[Xj\  +  C}  <  e“i\ 

Let  C  =  2I0/h2,  so  that  with  probability  exceeding  1  -  e-*3",  an  individual  X j(f),  ( 0  <  J  <  i  -  1),  is 
bounded  by  ^r<i(l  +  l/hp+i  XlLLzll  _|_  LhiJ ,  if  the  earlier  A)(r)-s  satisfy  their  respective  bounds 
for  t  <t  —  1.  According  to  the  definitions  of  A,  and  for  i  >  0, 


i-l 


IA'(f)l<  E (i-j)Xj(t), 

j= o 


whence. 


i-l 

ia(*)i<E(*-^ 

j=0 


E(1  +  \/hv"x'J-T  l!  +  — 0 

r<f  ??1 


h2 


i-l 


<  /0  +  (1  +  l/h)1  E (*  -  i) 

J=0 


E 

T<t 


Af{j,T-  1) 


which  by  (26)  is 
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<  (1  +  1  /hytf(i,t). 

Hence  Vr  <t,i<  h, 

1 

|£,-(t)|  <  (1  +  with  probability  1  -  (ti)e  . 

2 1 

Now  let  I0  —  ^-[ln(l/(l-a1))|/|  +  31n?t],  so  that  {hn)e~~h^  <  Upon  substituting,  we  see 

that  in  DH  (and  UH ),  the  size  of  the  eligible  collision  set  |£(ct1ra)|  is  bounded  by  3hN(h  -  1 ,  a^re), 
with  probability  1  -  ^1~^'  - . 

The  size  of  the  witness  set  II  can  be  bounded  similarly.  Let  Wj(t )  be  the  witness  set  at  time 
r  containing  elements  that  were  due  to  a  hit  to  £y 


<  ^2  x(p{yr'  ,i)  e  Cj (r'  -  1)  for  some  probe  number  i  e  [1,  h  -  j] ). 

T'<T 

If  we  are  given  upper  bounds  Rj{r'  -  1)  for  the  size  of  the  Cj{r'  -  1),  then  |  II j ( r)  |  is  bounded  by  the 

sum  of  r  independent  Bernoulli  trials  XT> .  with  probabilities  of  success  pT>  <  [h  j )  U j{  t'  I)///,. 

_  2+ 

We  have  just  shown  that  jT]r,<r(/i.  ~  ~  l)l/Ki  is  with  probability  1  -  (th)e  bounded  by 

J2Ti<T[^Jr^/ h)J  (h-j)\AfJ(T,-l)\/7i1.  The  Chernoff-Hoeffding  bound  from  Lemma  A2  and  the  bound 
for  £j(t)  gives: 


Prob{\Wj(r)\]  >  (1  +  1/h)  ^  (1  +  l/hy(  h  -  j)\Nj{r'  -  l)|/fti  +  <  e  *3°  . 

T'<T 


Hence  W (r)  is  with  probability  1  -2 hre  h3  >  1 


2^  E  bounded  by 


E  ( E(1  +  1W+1(*-i)i^# 

j= 0  V'<r 


l)l/nl  +  2 


<  3A f(h,  t). 


We  again  take  r  =  n ^  and  establish  the  bound  for  the  witness  set  as  stated. 

Part  3)  follows  from  Lemma  6  and  the  bounds  for  \W\  and  |£|.  | 

The  only  remaining  step  is  to  prove  that  witness  graphs  give  multiplicative  vacancy  overesti¬ 
mators. 
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Theorem  3. 

1)  1)  is  a  multiplicative  vacancy  overestimator  for  I).  Hence,  for  any 

fixed  a  <  1,  fixed  h ,  and  T  c  [l,cm]  with  \T\  =  O(logn), 


<Aj)h(TJ)={  1  + 


o(  m 


n 


2)  In  DH  and  DH for  >  21/i5e2(fe+1/2)2/35  we  have  the  multiplicative  vacancy  overestimator 

M  *  (21- (i)2)*-^  0(1) 

n  +  ,/h  -\  1(1  -  1)2  '• 


Proof:  Lemmas  5,  6  and  7  establish  that 

$l(T,  I)=(  1  +  y  ProV"H’{WG } 

V  /  WGeWernpty(T,I) 

=  /f  +  2ii^l|  £  Prob£L^lt{WG}x  Prob^hlt{WG}: 

V  n  /WGeWempty(T,I) 

where  Wcrnpty{T .  /)  denotes  the  set  of  all  graphs  of  0(1  +  logn)  keys  that  could  be  the  witness  graph 
for  the  pair  (T,  I),  and  which  declare  all  I  locations  empty.  Prob(j-f^li{WG}  is  the  probability  that 
vertices  included  in  the  witness  graphs  behave  as  prescribed  and  Prob™^hli {WG}  is  the  probability 
that  no  other  vertex  has  an  eligible  hit. 

We  may  partition  each  forest  WG  into  the  \I\  trees  Hhj. . . .  ,  MG|q,  rooted  at  the  respective 
locations  Jl3 . . . ,  i|/|.  ft  is  easy  to  see  that  Prob^~^zi  {WG}  —  (1  +  0(^^\  ))  n^i  Pro&^^fLLGj}, 
where  the  factor  (1  +  0(  11  ))  is  needed  to  account  for  the  conditioning  needed  to  ensure  that  the 

trees  have  disjoint  embeddings.  Lemma  5  remarks  that  in  UH ,  Prob1™^ lul  { M  G}  can  be  expressed 
as  n r:  x'TeD'-w(  l  ~  a*(T 3  WG)),  and  Prob^htt {WGZ}  =  []-:  ucG  -  (t*(t,WGz)).  ft  is  not 

hard  to  verify  that  for  any  xT  e  D'  -  W\  cr*(r,  LLGJ  =  (1  +  )ct**(t,  MGj ) ,  where  a**(r,  MG,,) 

is  the  conditional  probability  that  xT  has  an  eligible  probe  hitting  M  G,,  given  that  its  eligible 
probes  do  not  hit  the  relevant  portions  of  H  Gj .  WG2,  . . . ,  H'G,_|.  Consequently,  1  -  a*(r,  14  G,,)  = 
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(1  +  )( 1  -  a**(r,WGj)).  Multiplying  over  r  and  i  gives: 

9W(T,/)=  £ 

WGeWempty(T,I) 

(1  +  °(I^GI2)  )Prob£u~^t{WG}  |j  H  ( 1  +  _  a*(Tj  WGi)) 

i=  1  r: 

=  E 

WGeWempty(T,I) 

(1  +  EOEEi)  Pro&^fWGJtl  +  Qd^l^.-D)  Yl  (1  -  <r*(r,  M?$ 

7=1  r:  xyeil'-PF 

e  (i + 0||”'G|2>)  n  pro^MivGi}  n  a  -  v(t,wg,)) 

WGeWempty  (TJ)  i=l  r:  x‘TeD'-Wi 

°(\WG\\WGi\ ))probt-hit{WG 

,}  n  (i-o-(T,wG,n, 


<n  e  (i + 


n 


•=1  WG^WRmpty{%,Ii)  r-.  x‘TeD'-W* 

Evaluating  each  outer  factor  with  size  limits  as  in  (23)  gives 

Pi 


•S]H(T,I)  <  IP1  +  -  lh,  h)i 


(27) 


7  =  1 


where  we  take  u’J^y  ( /,  f)  to  be  1,  for  t  <  1,  and  subtract  |/|  from  Tt  in  (27)  because  D'  comprises 
the  elements  D  —  DT. 

Appealing  to  Lemma  6.2  gives: 


<  a + n  ivkTi  -  i/i), 

i=i 

whence  a  hnal  simplification  shows  that 

«S,'“hrF)<(i+^!h)n«Sfhr.), 


(28) 


1  =  1 


which  establishes  1). 

Claim  2)  is  a  direct  consequence  of  1),  Lemma  4,  and  Lemma  7. 

To  be  precise,  (28)  follows  for  our  specific  function  •  Modestly  annoying  combinatorial  argu¬ 
ments  would  be  needed  to  establish  (28)  in  full  generality;  we  forbear. 

Part  3  of  Lemma  7  ensures  that  and  inherit  a  multiplicative  formulation  comparable 

to  that  for  I 
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4.  Conclusions 

We  have  shown  that  in  double  hashing,  a  universal  family  of  log  n- wise  independent  hash  functions 
can  give  nearly  optimal  performance  for  any  fixed  load  bounded  below  1. 

These  results  comprise  a  significant  step  toward  understanding  why  extremely  simple  functions 
seem  to  perform  so  well  when  used  to  double  hash  arbitrary  values  into  a  partially  filled  table. 
Indeed,  it  is  quite  conceivable  that  real  data,  when  hashed  by  such  functions,  might  yield  sequences 
that  exhibit  O(log n)-wise  independence. 

Our  proof  technique  analyzed  local  and  global  hashing  interactions  separately,  and  used  analytic 
tools  to  measure  complicated  but  weakly  correlated  events  in  terms  of  simpler  independent  processes. 
Surely  these  methods  can  be  applied  to  other  probabilistic  processes  that  exhibit  weak  correlations 
and  that  might  be  supported  only  by  a  source  of  limited  randomness. 

5.  Appendix 

This  section  contains  two  technical  Lemmas,  which  can  simplify  large  deviation  calculations  in  cases 
of  full  and  limited  independence.  Lemma  A. 2  is  a  special  case  of  Theorem  6  in  [15]. 


Lemma  Al.  Let  X  —  Yx=  l  be  the  sum  of  n  mutually  independent  Bernoulli  trials  A1: . . . ,  Xn, 
where  Prob{ Ay  =  1}  —  pl.  Then 


for  any  C  >  0,  and  0  <  e  <  1,  Prob{ X  >  (1  +  e)E[A']  +  C}  <  e  125eC. 
Proof:  Let  p=  SS.  According  to  Hoeffding,  [Ho-63]: 


/  i  x  (l+fi)E[X] 

Prob{X>(l+S)P,[X]}< 


1  -P 


n-(l+6)E[X] 


1~(1+6)P/ 


< 


1  +  6 


(1+6)E[X] 


,6E[X] 


Let  C  —  (6  t)K[A’J.  ft  suffices  to  show  that  for  any  6  >  0,  and  0  <  e  <  1, 

l  \  (i+«)E[JT] 


1+6 


e8E[X]  <  e-1.25(e)(ff-:c)E[X]_ 


(For  6  <  1,  a  simple  derivation  gives  the  slightly  better  bound  where  the  1.25,  is  replaced  by  |.) 
We  therefore  need  only  show  that  for  all  0  <  e  <  1, 

l  \  (l+*) 


1  +  6 


e6  <  e-1.25e(6-<r). 
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Let  g(8,  e)  =  '  e^+1-25e^_t).  For  any  fixed  e,  g(8.  e)  attains  its  maximum  at  1  +  8  =  eL25<:. 

Therefore  log  g(8,  e)  <  log^(e1-25f  1 .  t)  =  e125e  -  1  -  1.25e  -  1.25e2.  For  e  >  0,  the  expression 

el-2oc  _  i  _  1.25e  -  1.25e2  first  decreases  and  then  increases  since  it  is  initially  decreasing  and  its 
derivative  is  the  sum  of  a  linear  function  and  a  function  with  a  rapidly  increasing  derivative.  Hence 
/(e)  =  g(e1-2oc  -  1,  e)  first  decreases  and  then  increases,  ft  follows  that  f(e  )<  max  {/(0),  /( 1 )},  for 
e  e  [0,1].  Since  log/(0)  =  0,  and  calculation  shows  that  log/(l)  <  -.009,  we  see  that  g(8,e)  <  1, 
for  all  8 ,  and  e  e  [0, 1].  | 


Lemma  A2.  Let  A/,  A2, .  ■ . ,  Xn  and  Y ) ,  Y-> . . . . .  V'„  be  Bernoulli  trials  with  probabilities  of 
success  K[A,J  =  F[ V;|  =  pt.  Let  A’  =  Ya=i  Ay-  Suppose  that  the  K,-s  are  mutually  independent,  and 
that  the  A/-s  are  fully  if>- wise  independent.  Let  I  =  {*i,  *2i  ■  -  ■ ,  * k)i  an(l  ^  I  =  [1,  n]  -  I .  Let  p(I)  — 


Prob{  (/\jeI(Yij  =  1))  A  (a jej{Yi.  =  0))},  and  let  p^(I)  =  Prob{(/\jeI( X{.  =  1))  A  (a/V/(  A =  0))}, 
so  that  the  subscript  if)  indicates  that  the  event  is  with  respect  to  the  fully  i/’-wise  independent 
trials  X1,X2i  ■  •  • ,  Xn. 


( 1 )  For  if)  >  D  +  k  +  eE[A']  -  log  -  Pi ))  and  some  e  where  |e|  <  1, 

P4<{I)  =  p(I){  l+ee~D) 


(2)  If  V*:  pi  <  1/2,  then  for  if)  >  D  +  k  +  5E[A]  and  some  e  where  |e|  <  1, 

Pv(0  =P(I)(1  +  ee~D). 

Proof:  The  proof  of  Lemma  A2  is  a  special  case  of  Theorem  6  in  [15].  ft  is  given  here  for 
completeness. 

We  may  use  standard  inclusion-exclusion  to  estimate  the  probability  p.^(I)  as  follows. 

p^(/)  =  Pro6^{^A(^  =  l)j  *  ^A(^  =  °)j> 


=  Prob^i  A(^  =  l)lE  E  (“I  yProb^i  A  (E  =  !))■ 

jel  1=0  ik+1<---<ik+til  jefi+1<...<fi+£ 

Truncating  the  outer  summation  at  £  =  tjj  —  k  introduces  an  error  that  is  bounded  by  the  last 

term  of  the  truncated  sum.  Let  p^,{k)  and  pT(k)  denote  these  truncated  sums,  in  the  respective 

cases  of  i/’-wise  and  full  independence.  Since  the  first  if'  -  k  terms  in  the  outer  summation  are 
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the  same  for  both  fully  and  ip-wise  independent  random  variables,  p^(k)  =  pT{k).  Furthermore, 
Probi>{f\je{i^,.,ik+l}(Xj  =  !))  =  FI )=\Pif  Hence 

p^k) = pT(k )  -  (-i)^-*^[n Pij]  y  n  p>j » 

jel  j=  1 

iji1 

for  some  6^  e  [0, 1],  and  an  identical  inequality  holds  without  the  ip  subscripts.  Hence 

i pi> ( j)  -  p^) i  <  tn  Pij ]  y  n  p^  ■ 

jel  i ; <...<iu  ,k  j  1 

ip-k 

y  n  Pi.  is  maximized  when  all  pt.  are  equal  and  therefore  the  error  \p^,{k)-p(k)\  is  bounded 
*1  <---<i%p_k  J  =  1 

by 

in  *,](*,  -  *)  (Pl+Klr+i>,)M  s 

To  get  multiplicative  error  bounds,  we  need  that  (E [X])^~k / (ip  -  k)\  <  e~D  Y\jgi{  1  ~Pj  )■  Setting 
if>-k=  eE[X]  -  log  (n^/(l  -  Pj))  +  £>  ,  gives: 

ip-k 


(E[V :})*-kM-k)\< 


eE[-Y]\ 
ip-k  ) 


ip-k 


<  1  + 


log  (rijv/i1  -Pj))  +  D' 
+  ip  -  k 


< e  n*1  p.i]- 


3$I 


This  proves  1.  The  second  inequality  follows  immediately  by  observing  that  if,  say,  Vi:  pi  <  then 

-  log  (n  u  -  pj) ]  =YYP-i^Y2pjY  {-^tl  =  -Y  2pj  1os  \  <  !l-l!Kl-vi-  ■ 

\jgl  /  j$Ik>  0  jgl  k>  0  j 
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